Re: [Gadget 4] possibly a bug in pm_nonperiodic.cc after snapshot was saved

From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
Date: Thu, 6 May 2021 14:39:11 +0200

Hi Weiguang,

After thinking about this issue, I guess you have most likely enabled PMGRID and HIERARCHICAL_GRAVITY, but
not TREEPM_NOTIMESPLIT... The code has then been able to run ok (this is why it worked for you at higher redshift), until you hit a situation where not all particles where put onto a single timestep anymore.

The reason for this is that using the hierarchical time integration is presently not compatible with imposing an additional split of the Hamiltonian on some spatial scale set by the (high-res) PM-mesh, which is what Gadget2 does by default. If you want to use HIERARCHICAL_GRAVITY together with a PM mesh, you therefore need to enable in addition the option TREEPM_NOTIMESPLIT... The code wasn't yet checking whether a user is actually doing this - but I have now added such a test, and there will be complaint now when one attempts this.

The background of this issue is explained in subsection 4.3 of the Gadget4 code paper. So you either have to disable HIERARCHICAL_GRAVITY, or enable TREEPM_NOTIMESPLIT. Especially in the latter case, whether or not it makes sense to use a high-res mesh is much less clear (and particularly for FMM it is often a significant disadvantage).
 
Yes, your run seems to suffer from significant work-load imbalance losses. It's hard to say without studying the full logs and your particular simulation setup why this is. You can take a look what the domain.txt file says - this gives information about the level of work-load balance the code thinks it should achieve based on the domain decomposition. If this is already looking bad, then increasing TopNodeFactor should in principle help.

Regards,
Volker

> On 4. May 2021, at 16:03, Weiguang Cui <cuiweiguang_at_gmail.com> wrote:
>
> Hi Volker and Gadget-helpers,
>
> I am running a zoomed-in test with Gadget4, which works fine for saving previous snapshots, it reported an error after saving this snapshot. I think there may be some miss connections with the PM part of force calculation. A brief check with previous snapshot writing, I only found force calculations following with FMM. Detailed error report follows:
>
> ```
> SNAPSHOT: writing snapshot block 9 (SubfindVelDisp)...
> SNAPSHOT: done with writing snapshot. Took 8.2192 sec, total size 876.136 MB, corresponds to effective I/O rate of 106.596 MB/sec
>
> SNAPSHOT: writing snapshot file #39 _at_ time 0.139043 ...
> SNAPSHOT: writing snapshot file: './snapshot-prevmostboundonly_039' (file 1 of 1)
> SNAPSHOT: writing snapshot rename './snapshot-prevmostboundonly_039.hdf5' to './bak-snapshot-prevmostboundonly_039.hdf5'
> SNAPSHOT: writing snapshot block 0 (Coordinates)...
> SNAPSHOT: writing snapshot block 1 (Velocities)...
> SNAPSHOT: writing snapshot block 2 (ParticleIDs)...
> SNAPSHOT: writing snapshot block 7 (SubfindDensity)...
> SNAPSHOT: writing snapshot block 8 (SubfindHsml)...
> SNAPSHOT: writing snapshot block 9 (SubfindVelDisp)...
> SNAPSHOT: done with writing snapshot. Took 0.0642038 sec, total size 0.201492 MB, corresponds to effective I/O rate of 3.13832 MB/sec
>
> SNAPSHOT: Setting next time for snapshot file to Time_next= 0.142332 (DumpFlag=1)
>
> KICKS: 1st gravity for hierarchical timebin=20: 21573294 particles dt_gravkick=0.0313998 0.0313998 0.0313998
> KICKS: 1st gravity for hierarchical timebin=19: 21573294 particles dt_gravkick=-0.015704 0.0156958 0.0156958
> ACCEL: Start tree gravity force computation... (1111209 particles)
> TREEPM: Starting PM part of force calculation. (timebin=18)
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 100 in communicator MPI_COMM_WORLD
> with errorcode 1.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> Code termination on task=96, function pmforce_nonperiodic(), file src/pm/pm_nonperiodic.cc, line 1472: unexpected NSource != Sp->NumPart
> Code termination on task=97, function pmforce_nonperiodic(), file src/pm/pm_nonperiodic.cc, line 1472: unexpected NSource != Sp->NumPart
>
> ```
>
> Please let me know if you need other information about the test.
>
>
> Another unrelated question, I think some of my test/config parameters are not setting properly. As you can see from the last time step in the cpu.txt file, treeimbalance occupies half of the cpu time. Do you have any suggestions on how to improve this?
> ```
> Step 1925, Time: 0.139002, CPUs: 128, HighestActiveTimeBin: 15
> diff cumulative
> total 0.01 100.0% 15689.45 100.0%
> treegrav 0.01 72.6% 12704.52 81.0%
> treebuild 0.01 52.7% 324.26 2.1%
> insert 0.00 18.8% 224.70 1.4%
> branches 0.00 0.2% 9.11 0.1%
> toplevel 0.00 31.1% 28.95 0.2%
> treeforce 0.00 0.6% 12362.98 78.8%
> treewalk 0.00 0.0% 4490.18 28.6%
> treeimbalance 0.00 0.2% 7867.96 50.1%
> treefetch 0.00 0.0% 0.05 0.0%
> treestack 0.00 0.3% 4.80 0.0%
> pm_grav 0.00 0.0% 1915.98 12.2%
> ngbtreevelupdate 0.00 0.1% 0.07 0.0%
> ngbtreehsmlupdate 0.00 0.3% 0.10 0.0%
> sph 0.00 0.0% 0.00 0.0%
> density 0.00 0.0% 0.00 0.0%
> densitywalk 0.00 0.0% 0.00 0.0%
> densityfetch 0.00 0.0% 0.00 0.0%
> densimbalance 0.00 0.0% 0.00 0.0%
> hydro 0.00 0.0% 0.00 0.0%
> hydrowalk 0.00 0.0% 0.00 0.0%
> hydrofetch 0.00 0.0% 0.00 0.0%
> hydroimbalance 0.00 0.0% 0.00 0.0%
> domain 0.00 0.0% 275.44 1.8%
> peano 0.00 0.0% 44.75 0.3%
> drift/kicks 0.00 3.2% 181.57 1.2%
> timeline 0.00 0.0% 4.13 0.0%
> treetimesteps 0.00 0.0% 0.00 0.0%
> i/o 0.00 0.0% 400.66 2.6%
> logs 0.00 20.1% 31.70 0.2%
> fof 0.00 0.0% 39.95 0.3%
> fofwalk 0.00 0.0% 2.20 0.0%
> fofimbal 0.00 0.0% 2.91 0.0%
> subfind 0.00 0.0% 20.92 0.1%
> restart 0.00 0.0% 10.89 0.1%
> misc 0.00 3.7% 58.77 0.4%
> ```
>
> Many thanks.
>
> Best,
> Weiguang
>
> -------------------------------------------
> https://weiguangcui.github.io/
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
Received on 2021-05-06 14:39:11

This archive was generated by hypermail 2.3.0 : 2023-01-10 10:01:32 CET