Re: Problems with treebuild -- setting the TREE_NUM_BEFORE_NODESPLIT

From: Weiguang Cui <cuiweiguang_at_gmail.com>
Date: Thu, 9 Sep 2021 16:26:46 +0100

Dear Volker,

For the problem of MPI_Sendrecv call in SUBFIND, I think it happened in
this function ` SubDomain->particle_exchange_based_on_PS(SubComm);` -- line
101 in subfind_processing.cc. After changing the "MPI_Sendrecv" with
"myMPI_Sendrecv" within this function in file `domain_exchange.cc`. The
code does not report the MPI_Sendrecv error.

However, the original SUBFIND problem shows up again:
```
Code termination on task=2, function treebuild_insert_group_of_points(),
file src/tree/tree.cc, line 489: It appears we have reached the bottom of
the tree because there are more than TREE_NUM_BEFORE_NODESPLIT=96 particles
in the smallest tree node representable for BITS_FOR_POSITIONS=64.
Either eliminate the particles at (nearly) indentical coordinates, increase
the setting for TREE_NUM_BEFORE_NODESPLIT, or possibly enlarge
BITS_FOR_POSITIONS if you have really not enough dynamic range
```
As you can see, I have increased "TREE_NUM_BEFORE_NODESPLIT=96". Increasing
this value to 128 requires an encasement of the MaxOnFetchStack in the
fmm.cc which caused memory problems. Here is my current set:
`MaxOnFetchStack = std::max<int>(50 * (Tp->NumPart + NumPartImported), 9 *
TREE_MIN_WORKSTACK_SIZE);`
If you suggest an even larger value, I can only restart form snapshot and
change the number of nodes.
By the way, the code runs a little bit slow with a large value
of TREE_MIN_WORKSTACK_SIZE

I have rsynced the recent run slurm.3774931.out to the m200n2048-dm/ for
your reference.

Thank you for the comment on the lightcone thickness, my mistake of failing
to notice the unit. I hope that won't connect to the SUBFIND problem and I
have increased the value to 2 Mpc/h.

Thank you for your help!

Best,
Weiguang

-------------------------------------------
https://weiguangcui.github.io/


On Wed, Sep 8, 2021 at 8:33 PM Volker Springel <
vspringel_at_mpa-garching.mpg.de> wrote:

>
> Dear Weiguang,
>
> Sorry for my sluggish answer. Too many other things on my plate.
>
> I think the crash you experienced in an MPI_Sendrecv call in SUBFIND
> happens most likely in line 270 of the file
> src/subfind/subfind_distribute.cc, because this call there is not protected
> yet against transfer sizes that exceed 2 GB in total... For the particle
> number and setup you're using, you are actually having a particle storage
> of ~1.52 GB or so on average. With a memory imbalance of ~30% (which you
> actually just reach according to your log file), it is possible that you
> reach the 2GB at this place, causing the native call of MPI_Sendrecv to
> fail.
>
> If this is indeed the problem, then replacing in line 270 to 274
> "MPI_Sendrecv" with "myMPI_Sendrecv" should fix it. I have also made this
> change in the code repository also.
>
>
> Thanks for letting me know that you had to change the default size of
> TREE_MIN_WORKSTACK_SIZE to get around the bookkeeping buffer problem you
> experienced in fmm.cc. I guess I need to think about how this setting can
> be adjusted automatically so that it works in conditions like the one you
> created in your run.
>
> Best,
> Volker
>
>
>
>
> > On 2. Sep 2021, at 11:23, Weiguang Cui <cuiweiguang_at_gmail.com> wrote:
> >
> > Hi Volker,
> >
> > Did you find some time to look at the problem? I would like to have this
> run finished a.s.a.p. So I further modified the code (see the Gadget4
> folder for changes with git diff):
> > FMM factor is increased to 50
> > - MaxOnFetchStack = std::max<int>(0.1 * (Tp->NumPart +
> NumPartImported), TREE_MIN_WORKSTACK_SIZE);
> > + MaxOnFetchStack = std::max<int>(50 * (Tp->NumPart + NumPartImported),
> 10 * TREE_MIN_WORKSTACK_SIZE);
> > and the tree_min_workstack_size in gravtree.h is also increased:
> > -#define TREE_MIN_WORKSTACK_SIZE 100000
> > +#define TREE_MIN_WORKSTACK_SIZE 400000
> >
> > With these modifications, the code did not show the `Can't even process
> a single particle` problem in fmm, but crashed with an MPI_Sendrecv problem
> at subfind. See the job slurm.3757634 for details. Maybe this is connected
> with the previous SUBFIND construction problem, just too many particles in
> the halo??
> > If there is no easy fix, I probably will exclude the SUBFIND part to
> finish the run which is a pity as the full merge tree needs to be redone.
> >
> > Thank you.
> >
> > Best,
> > Weiguang
> >
> > -------------------------------------------
> > https://weiguangcui.github.io/
> >
> >
> > On Sun, Aug 29, 2021 at 5:57 PM Volker Springel <
> vspringel_at_mpa-garching.mpg.de> wrote:
> >
> > Hi Weiguang,
> >
> > The tree construction problem in subfind is odd and still bothers me.
> Could you perhaps make the run available to me on cosma7 so that I can
> investigate this myself?
> >
> > I agree that there should be enough total memory for FMM, but the
> termination of the code looks to be caused by an insufficient size
> allocation of internal bookkeeping buffers related to the communication
> parts of the algorithm. If you're add it, you could also make this setup
> available to me, then I can take a look why this happens.
> >
> > Regards,
> > Volker
> >
> > > On 24. Aug 2021, at 12:29, Weiguang Cui <cuiweiguang_at_gmail.com> wrote:
> > >
> > > Hi Volker,
> > >
> > > This is a pure dark-matter particle run. This happens when the
> simulation ran to z~0.3.
> > > As you can see from the attached config options, this simulation used
> an old IC file, neither the double-precision output is opened.
> > >
> > > I increased the factor from 0.1 to 0.5, which still resulted in the
> same error in the fmm.cc. I don't think memory is an issue here. As shown
> in memory.txt, the maximum occupied memory (in the whole file) is
> > > ```MEMORY: Largest Allocation = 11263.9 Mbyte | Largest Allocation
> Without Generic = 11263.9 Mbyte``` and the parameter ```MaxMemSize
> 18000 % in MByte``` is in agreement with the machine's memory
> (cosma7). I will increase the factor to an even higher value to see if that
> works.
> > >
> > > If the single-precision position is not an issue, could it be caused
> by the `FoFGravTree.treebuild(num, d);` or
> `FoFGravTree.treebuild(num_removed, dremoved);` in subfind_unbind in which
> an FoF group has too many particles in a very small volume to build the
> tree?
> > >
> > > Any suggestions are welcome. Many thanks!
> > >
> > > ==================================
> > > ALLOW_HDF5_COMPRESSION
> > > ASMTH=1.2
> > > DOUBLEPRECISION=1
> > > DOUBLEPRECISION_FFTW
> > > FMM
> > > FOF
> > > FOF_GROUP_MIN_LEN=32
> > > FOF_LINKLENGTH=0.2
> > > FOF_PRIMARY_LINK_TYPES=2
> > > FOF_SECONDARY_LINK_TYPES=1+16+32
> > > GADGET2_HEADER
> > > IDS_64BIT
> > > LIGHTCONE
> > > LIGHTCONE_IMAGE_COMP_HSML_VELDISP
> > > LIGHTCONE_MASSMAPS
> > > LIGHTCONE_PARTICLES
> > > LIGHTCONE_PARTICLES_GROUPS
> > > MERGERTREE
> > > MULTIPOLE_ORDER=3
> > > NTAB=128
> > > NTYPES=6
> > > PERIODIC
> > > PMGRID=4096
> > > RANDOMIZE_DOMAINCENTER
> > > RCUT=4.5
> > > SELFGRAVITY
> > > SUBFIND
> > > SUBFIND_HBT
> > > TREE_NUM_BEFORE_NODESPLIT=64
> > > ===========================================================
> > >
> > >
> > > Best,
> > > Weiguang
> > >
> > > -------------------------------------------
> > > https://weiguangcui.github.io/
> > >
> > >
> > > On Mon, Aug 23, 2021 at 1:49 PM Volker Springel <
> vspringel_at_mpa-garching.mpg.de> wrote:
> > >
> > > Hi Weiguang,
> > >
> > > The code termination you experienced in the tree construction during
> subfind is quite puzzling to me, especially since you used
> BITS_FOR_POSITIONS=64... In principle, this situation should only arise if
> you have a small group of particles (~16) in a region about 10^18 smaller
> than the boxsize. Has this situation occurred during a simulation run, or
> in postprocessing? If you have used single precision for storing positions
> in a snapshot file, or if you have dense blobs of gas with intense star
> formation, then you can get occasional coordinate collisions of two or
> several particles, but ~16 seems increasingly unlikely. So I'm not sure
> what's really going on here. Have things acually worked when setting
> TREE_NUM_BEFORE_NODESPLIT=64?
> > >
> > > The issue in FMM is a memory issue. It should be possible to resolve
> it with a higher setting of MaxMemSize, or by enlarging the factor 0.1 in
> line 1745 of fmm.cc,
> > > MaxOnFetchStack = std::max<int>(0.1 * (Tp->NumPart + NumPartImported),
> TREE_MIN_WORKSTACK_SIZE);
> > >
> > > Best,
> > > Volker
> > >
> > >
> > > > On 21. Aug 2021, at 10:10, Weiguang Cui <cuiweiguang_at_gmail.com>
> wrote:
> > > >
> > > > Dear all,
> > > >
> > > > I recently met another problem with the 2048^3, 200 mpc/h run.
> > > >
> > > > treebuild in SUBFIND requires a higher value for
> TREE_NUM_BEFORE_NODESPLIT:
> > > > ==========================================================
> > > > SUBFIND: We now execute a parallel version of SUBFIND.
> > > > SUBFIND: Previous subhalo catalogue had approximately a size
> 2.42768e+09, and the summed squared subhalo size was 8.42698e+16
> > > > SUBFIND: Number of FOF halos treated with collective SubFind
> algorithm = 1
> > > > SUBFIND: Number of processors used in different partitions for the
> collective SubFind code = 2
> > > > SUBFIND: (The adopted size-limit for the collective algorithm was
> 9631634 particles, for threshold size factor 0.6)
> > > > SUBFIND: The other 10021349 FOF halos are treated in parallel with
> serial code
> > > > SUBFIND: subfind_distribute_groups() took 0.044379 sec
> > > > SUBFIND: particle balance=1.10537
> > > > SUBFIND: subfind_exchange() took 30.2562 sec
> > > > SUBFIND: particle balance for processing=1
> > > > SUBFIND: root-task=0: Collectively doing halo 0 of length 10426033
> on 2 processors.
> > > > SUBFIND: subdomain decomposition took 8.54527 sec
> > > > SUBFIND: serial subfind subdomain decomposition took 6.0162 sec
> > > > SUBFIND: root-task=0: total number of subhalo coll_candidates=1454
> > > > SUBFIND: root-task=0: number of subhalo candidates small enough to
> be done with one cpu: 1453. (Largest size 81455)
> > > > Code termination on task=0, function
> treebuild_insert_group_of_points(), file src/tree/tree.cc, line 489: It
> appears we have reached the bottom of the tree because there are more than
> TREE_NUM_BEFORE_NODESPLIT=16 particles in the smallest tree node
> representable for BITS_FOR_POSITIONS=64.
> > > > Either eliminate the particles at (nearly) indentical coordinates,
> increase the setting for TREE_NUM_BEFORE_NODESPLIT, or possibly enlarge
> BITS_FOR_POSITIONS if you have really not enough dynamic range
> > > > ==============================================
> > > >
> > > > But, if I increase the TREE_NUM_BEFORE_NODESPLIT to 64, FMM seems
> not working:
> > > > =============================================================
> > > > Sync-Point 19835, Time: 0.750591, Redshift: 0.332284, Systemstep:
> 5.27389e-05, Dloga: 7.02657e-05, Nsync-grv: 31415, Nsync-hyd:
> 0
> > > > ACCEL: Start tree gravity force computation... (31415 particles)
> > > > TREE: Full tree construction for all particles. (presently
> allocated=7626.51 MB)
> > > > GRAVTREE: Tree construction done. took 13.4471 sec
> <numnodes>=206492 NTopnodes=115433 NTopleaves=101004
> tree-build-scalability=0.441627
> > > > FMM: Begin tree force. timebin=13 (presently allocated=0.5 MB)
> > > > Code termination on task=0, function gravity_fmm(), file
> src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > > > Code termination on task=887, function gravity_fmm(), file
> src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > > > Code termination on task=40, function gravity_fmm(), file
> src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > > > Code termination on task=888, function gravity_fmm(), file
> src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > > > Code termination on task=889, function gravity_fmm(), file
> src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > > > Code termination on task=3, function gravity_fmm(), file
> src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > > > Code termination on task=890, function gravity_fmm(), file
> src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > > > Code termination on task=6, function gravity_fmm(), file
> src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > > > Code termination on task=891, function gravity_fmm(), file
> src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > > > Code termination on task=9, function gravity_fmm(), file
> src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > > > Code termination on task=892, function gravity_fmm(), file
> src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > > > Code termination on task=893, function gravity_fmm(), file
> src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > > > Code termination on task=894, function gravity_fmm(), file
> src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > > > Code termination on task=20, function gravity_fmm(), file
> src/fmm/fmm.cc, line 1879: Can't even process a single particle
> > > > ======================================
> > > >
> > > > I don't think fine-tuning the value for TREE_NUM_BEFORE_NODESPLIT is
> a solution.
> > > > I can try to use BITS_FOR_POSITIONS=128 by setting
> POSITIONS_IN_128BIT, but I am afraid that the code may not be able to run
> from restart files.
> > > > Any suggestions?
> > > > Many thanks.
> > > >
> > > > Best,
> > > > Weiguang
> > > >
> > > > -------------------------------------------
> > > > https://weiguangcui.github.io/
> > > >
> > > > -----------------------------------------------------------
> > > >
> > > > If you wish to unsubscribe from this mailing, send mail to
> > > > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe
> gadget-list
> > > > A web-archive of this mailing list is available here:
> > > > http://www.mpa-garching.mpg.de/gadget/gadget-list
> > >
> > >
> > >
> > >
> > > -----------------------------------------------------------
> > >
> > > If you wish to unsubscribe from this mailing, send mail to
> > > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe
> gadget-list
> > > A web-archive of this mailing list is available here:
> > > http://www.mpa-garching.mpg.de/gadget/gadget-list
> > >
> > > -----------------------------------------------------------
> > >
> > > If you wish to unsubscribe from this mailing, send mail to
> > > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe
> gadget-list
> > > A web-archive of this mailing list is available here:
> > > http://www.mpa-garching.mpg.de/gadget/gadget-list
> >
> >
> >
> >
> > -----------------------------------------------------------
> >
> > If you wish to unsubscribe from this mailing, send mail to
> > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe
> gadget-list
> > A web-archive of this mailing list is available here:
> > http://www.mpa-garching.mpg.de/gadget/gadget-list
> >
> > -----------------------------------------------------------
> >
> > If you wish to unsubscribe from this mailing, send mail to
> > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe
> gadget-list
> > A web-archive of this mailing list is available here:
> > http://www.mpa-garching.mpg.de/gadget/gadget-list
>
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
>
Received on 2021-09-09 17:27:34

This archive was generated by hypermail 2.3.0 : 2022-09-01 14:03:43 CEST