Re: Problems with treebuild -- setting the TREE_NUM_BEFORE_NODESPLIT

From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
Date: Sun, 29 Aug 2021 18:56:17 +0200

Hi Weiguang,

The tree construction problem in subfind is odd and still bothers me. Could you perhaps make the run available to me on cosma7 so that I can investigate this myself?

I agree that there should be enough total memory for FMM, but the termination of the code looks to be caused by an insufficient size allocation of internal bookkeeping buffers related to the communication parts of the algorithm. If you're add it, you could also make this setup available to me, then I can take a look why this happens.

Regards,
Volker

> On 24. Aug 2021, at 12:29, Weiguang Cui <cuiweiguang_at_gmail.com> wrote:
>
> Hi Volker,
>
> This is a pure dark-matter particle run. This happens when the simulation ran to z~0.3.
> As you can see from the attached config options, this simulation used an old IC file, neither the double-precision output is opened.
>
> I increased the factor from 0.1 to 0.5, which still resulted in the same error in the fmm.cc. I don't think memory is an issue here. As shown in memory.txt, the maximum occupied memory (in the whole file) is
> ```MEMORY: Largest Allocation = 11263.9 Mbyte | Largest Allocation Without Generic = 11263.9 Mbyte``` and the parameter ```MaxMemSize 18000 % in MByte``` is in agreement with the machine's memory (cosma7). I will increase the factor to an even higher value to see if that works.
>
> If the single-precision position is not an issue, could it be caused by the `FoFGravTree.treebuild(num, d);` or `FoFGravTree.treebuild(num_removed, dremoved);` in subfind_unbind in which an FoF group has too many particles in a very small volume to build the tree?
>
> Any suggestions are welcome. Many thanks!
>
> ==================================
> ALLOW_HDF5_COMPRESSION
> ASMTH=1.2
> DOUBLEPRECISION=1
> DOUBLEPRECISION_FFTW
> FMM
> FOF
> FOF_GROUP_MIN_LEN=32
> FOF_LINKLENGTH=0.2
> FOF_PRIMARY_LINK_TYPES=2
> FOF_SECONDARY_LINK_TYPES=1+16+32
> GADGET2_HEADER
> IDS_64BIT
> LIGHTCONE
> LIGHTCONE_IMAGE_COMP_HSML_VELDISP
> LIGHTCONE_MASSMAPS
> LIGHTCONE_PARTICLES
> LIGHTCONE_PARTICLES_GROUPS
> MERGERTREE
> MULTIPOLE_ORDER=3
> NTAB=128
> NTYPES=6
> PERIODIC
> PMGRID=4096
> RANDOMIZE_DOMAINCENTER
> RCUT=4.5
> SELFGRAVITY
> SUBFIND
> SUBFIND_HBT
> TREE_NUM_BEFORE_NODESPLIT=64
> ===========================================================
>
>
> Best,
> Weiguang
>
> -------------------------------------------
> https://weiguangcui.github.io/
>
>
> On Mon, Aug 23, 2021 at 1:49 PM Volker Springel <vspringel_at_mpa-garching.mpg.de> wrote:
>
> Hi Weiguang,
>
> The code termination you experienced in the tree construction during subfind is quite puzzling to me, especially since you used BITS_FOR_POSITIONS=64... In principle, this situation should only arise if you have a small group of particles (~16) in a region about 10^18 smaller than the boxsize. Has this situation occurred during a simulation run, or in postprocessing? If you have used single precision for storing positions in a snapshot file, or if you have dense blobs of gas with intense star formation, then you can get occasional coordinate collisions of two or several particles, but ~16 seems increasingly unlikely. So I'm not sure what's really going on here. Have things acually worked when setting TREE_NUM_BEFORE_NODESPLIT=64?
>
> The issue in FMM is a memory issue. It should be possible to resolve it with a higher setting of MaxMemSize, or by enlarging the factor 0.1 in line 1745 of fmm.cc,
> MaxOnFetchStack = std::max<int>(0.1 * (Tp->NumPart + NumPartImported), TREE_MIN_WORKSTACK_SIZE);
>
> Best,
> Volker
>
>
> > On 21. Aug 2021, at 10:10, Weiguang Cui <cuiweiguang_at_gmail.com> wrote:
> >
> > Dear all,
> >
> > I recently met another problem with the 2048^3, 200 mpc/h run.
> >
> > treebuild in SUBFIND requires a higher value for TREE_NUM_BEFORE_NODESPLIT:
> > ==========================================================
> > SUBFIND: We now execute a parallel version of SUBFIND.
> > SUBFIND: Previous subhalo catalogue had approximately a size 2.42768e+09, and the summed squared subhalo size was 8.42698e+16
> > SUBFIND: Number of FOF halos treated with collective SubFind algorithm = 1
> > SUBFIND: Number of processors used in different partitions for the collective SubFind code = 2
> > SUBFIND: (The adopted size-limit for the collective algorithm was 9631634 particles, for threshold size factor 0.6)
> > SUBFIND: The other 10021349 FOF halos are treated in parallel with serial code
> > SUBFIND: subfind_distribute_groups() took 0.044379 sec
> > SUBFIND: particle balance=1.10537
> > SUBFIND: subfind_exchange() took 30.2562 sec
> > SUBFIND: particle balance for processing=1
> > SUBFIND: root-task=0: Collectively doing halo 0 of length 10426033 on 2 processors.
> > SUBFIND: subdomain decomposition took 8.54527 sec
> > SUBFIND: serial subfind subdomain decomposition took 6.0162 sec
> > SUBFIND: root-task=0: total number of subhalo coll_candidates=1454
> > SUBFIND: root-task=0: number of subhalo candidates small enough to be done with one cpu: 1453. (Largest size 81455)
> > Code termination on task=0, function treebuild_insert_group_of_points(), file src/tree/tree.cc, line 489: It appears we have reached the bottom of the tree because there are more than TREE_NUM_BEFORE_NODESPLIT=16 particles in the smallest tree node representable for BITS_FOR_POSITIONS=64.
> > Either eliminate the particles at (nearly) indentical coordinates, increase the setting for TREE_NUM_BEFORE_NODESPLIT, or possibly enlarge BITS_FOR_POSITIONS if you have really not enough dynamic range
> > ==============================================
> >
> > But, if I increase the TREE_NUM_BEFORE_NODESPLIT to 64, FMM seems not working:
> > =============================================================
> > Sync-Point 19835, Time: 0.750591, Redshift: 0.332284, Systemstep: 5.27389e-05, Dloga: 7.02657e-05, Nsync-grv: 31415, Nsync-hyd: 0
> > ACCEL: Start tree gravity force computation... (31415 particles)
> > TREE: Full tree construction for all particles. (presently allocated=7626.51 MB)
> > GRAVTREE: Tree construction done. took 13.4471 sec <numnodes>=206492 NTopnodes=115433 NTopleaves=101004 tree-build-scalability=0.441627
> > FMM: Begin tree force. timebin=13 (presently allocated=0.5 MB)
> > Code termination on task=0, function gravity_fmm(), file src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > Code termination on task=887, function gravity_fmm(), file src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > Code termination on task=40, function gravity_fmm(), file src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > Code termination on task=888, function gravity_fmm(), file src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > Code termination on task=889, function gravity_fmm(), file src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > Code termination on task=3, function gravity_fmm(), file src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > Code termination on task=890, function gravity_fmm(), file src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > Code termination on task=6, function gravity_fmm(), file src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > Code termination on task=891, function gravity_fmm(), file src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > Code termination on task=9, function gravity_fmm(), file src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > Code termination on task=892, function gravity_fmm(), file src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > Code termination on task=893, function gravity_fmm(), file src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > Code termination on task=894, function gravity_fmm(), file src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > Code termination on task=20, function gravity_fmm(), file src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > ======================================
> >
> > I don't think fine-tuning the value for TREE_NUM_BEFORE_NODESPLIT is a solution.
> > I can try to use BITS_FOR_POSITIONS=128 by setting POSITIONS_IN_128BIT, but I am afraid that the code may not be able to run from restart files.
> > Any suggestions?
> > Many thanks.
> >
> > Best,
> > Weiguang
> >
> > -------------------------------------------
> > https://weiguangcui.github.io/
> >
> > -----------------------------------------------------------
> >
> > If you wish to unsubscribe from this mailing, send mail to
> > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> > A web-archive of this mailing list is available here:
> > http://www.mpa-garching.mpg.de/gadget/gadget-list
>
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
Received on 2021-08-29 18:56:18

This archive was generated by hypermail 2.3.0 : 2023-01-10 10:01:33 CET