Re: [Gadget 4] possibly a bug in pm_nonperiodic.cc after snapshot was saved

From: Weiguang Cui <cuiweiguang_at_gmail.com>
Date: Tue, 11 May 2021 10:07:45 +0100

Hi Volker,

Sorry for this late confirmation. Yes, it is my misunderstanding of
the TREEPM_NOTIMESPLIT. With this option on, the code runs smoothly.

Increasing TopNodeFactor from ~5 to 9 indeed helps with
the treeimbalance by bringing it down to ~ 30% which is still a little
higher than expected. I will explore other options.

The last thing, I set the simulation to run to the future. It ran fine
but failed in saving the last snapshot:
```
Final time=1.5 reached. Simulation ends.

SNAPSHOT: writing snapshot file #128 _at_ time 1.5 ...
SNAPSHOT: writing snapshot file: './snapshot_128' (file 1 of 1)
SNAPSHOT: writing snapshot block 0 (Coordinates)...
SNAPSHOT: writing snapshot block 1 (Velocities)...
SNAPSHOT: writing snapshot block 2 (ParticleIDs)...
SNAPSHOT: writing snapshot block 3 (Masses)...
SNAPSHOT: writing snapshot block 7 (SubfindDensity)...
[miclap:457496] *** Process received signal ***
[miclap:457496] Signal: Segmentation fault (11)
[miclap:457496] Signal code: Address not mapped (1)
[miclap:457496] Failing at address: 0x38
[miclap:457496] [ 0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7fc6e1d7f3c0]
[miclap:457496] [ 1] Gadget4(+0x44540)[0x564037866540]
[miclap:457496] [ 2] Gadget4(+0x4b1f0)[0x56403786d1f0]
[miclap:457496] [ 3] Gadget4(+0x4ccf9)[0x56403786ecf9]
[miclap:457496] [ 4] Gadget4(+0x3d231)[0x56403785f231]
[miclap:457496] [ 5] Gadget4(+0x265c3)[0x5640378485c3]
[miclap:457496] [ 6] Gadget4(+0x12f32)[0x564037834f32]
[miclap:457496] [ 7]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fc6e1b9f0b3]
[miclap:457496] [ 8] Gadget4(+0x14ace)[0x564037836ace]
[miclap:457496] *** End of error message ***
```
Restart the run from restart files didn't help, but still, this could be
the server's problem. Please let me know if you don't think so and want to
open this issue in another thread.

Thank you very much.

Best,
Weiguang

-------------------------------------------
https://weiguangcui.github.io/


On Thu, May 6, 2021 at 1:40 PM Volker Springel <
vspringel_at_mpa-garching.mpg.de> wrote:

>
> Hi Weiguang,
>
> After thinking about this issue, I guess you have most likely enabled
> PMGRID and HIERARCHICAL_GRAVITY, but
> not TREEPM_NOTIMESPLIT... The code has then been able to run ok (this is
> why it worked for you at higher redshift), until you hit a situation where
> not all particles where put onto a single timestep anymore.
>
> The reason for this is that using the hierarchical time integration is
> presently not compatible with imposing an additional split of the
> Hamiltonian on some spatial scale set by the (high-res) PM-mesh, which is
> what Gadget2 does by default. If you want to use HIERARCHICAL_GRAVITY
> together with a PM mesh, you therefore need to enable in addition the
> option TREEPM_NOTIMESPLIT... The code wasn't yet checking whether a user is
> actually doing this - but I have now added such a test, and there will be
> complaint now when one attempts this.
>
> The background of this issue is explained in subsection 4.3 of the Gadget4
> code paper. So you either have to disable HIERARCHICAL_GRAVITY, or enable
> TREEPM_NOTIMESPLIT. Especially in the latter case, whether or not it makes
> sense to use a high-res mesh is much less clear (and particularly for FMM
> it is often a significant disadvantage).
>
> Yes, your run seems to suffer from significant work-load imbalance losses.
> It's hard to say without studying the full logs and your particular
> simulation setup why this is. You can take a look what the domain.txt file
> says - this gives information about the level of work-load balance the code
> thinks it should achieve based on the domain decomposition. If this is
> already looking bad, then increasing TopNodeFactor should in principle
> help.
>
> Regards,
> Volker
>
> > On 4. May 2021, at 16:03, Weiguang Cui <cuiweiguang_at_gmail.com> wrote:
> >
> > Hi Volker and Gadget-helpers,
> >
> > I am running a zoomed-in test with Gadget4, which works fine for saving
> previous snapshots, it reported an error after saving this snapshot. I
> think there may be some miss connections with the PM part of force
> calculation. A brief check with previous snapshot writing, I only found
> force calculations following with FMM. Detailed error report follows:
> >
> > ```
> > SNAPSHOT: writing snapshot block 9 (SubfindVelDisp)...
> > SNAPSHOT: done with writing snapshot. Took 8.2192 sec, total size
> 876.136 MB, corresponds to effective I/O rate of 106.596 MB/sec
> >
> > SNAPSHOT: writing snapshot file #39 _at_ time 0.139043 ...
> > SNAPSHOT: writing snapshot file: './snapshot-prevmostboundonly_039'
> (file 1 of 1)
> > SNAPSHOT: writing snapshot rename
> './snapshot-prevmostboundonly_039.hdf5' to
> './bak-snapshot-prevmostboundonly_039.hdf5'
> > SNAPSHOT: writing snapshot block 0 (Coordinates)...
> > SNAPSHOT: writing snapshot block 1 (Velocities)...
> > SNAPSHOT: writing snapshot block 2 (ParticleIDs)...
> > SNAPSHOT: writing snapshot block 7 (SubfindDensity)...
> > SNAPSHOT: writing snapshot block 8 (SubfindHsml)...
> > SNAPSHOT: writing snapshot block 9 (SubfindVelDisp)...
> > SNAPSHOT: done with writing snapshot. Took 0.0642038 sec, total size
> 0.201492 MB, corresponds to effective I/O rate of 3.13832 MB/sec
> >
> > SNAPSHOT: Setting next time for snapshot file to Time_next= 0.142332
> (DumpFlag=1)
> >
> > KICKS: 1st gravity for hierarchical timebin=20: 21573294 particles
> dt_gravkick=0.0313998 0.0313998 0.0313998
> > KICKS: 1st gravity for hierarchical timebin=19: 21573294 particles
> dt_gravkick=-0.015704 0.0156958 0.0156958
> > ACCEL: Start tree gravity force computation... (1111209 particles)
> > TREEPM: Starting PM part of force calculation. (timebin=18)
> >
> --------------------------------------------------------------------------
> > MPI_ABORT was invoked on rank 100 in communicator MPI_COMM_WORLD
> > with errorcode 1.
> >
> > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> > You may or may not see output from other processes, depending on
> > exactly when Open MPI kills them.
> >
> --------------------------------------------------------------------------
> > Code termination on task=96, function pmforce_nonperiodic(), file
> src/pm/pm_nonperiodic.cc, line 1472: unexpected NSource != Sp->NumPart
> > Code termination on task=97, function pmforce_nonperiodic(), file
> src/pm/pm_nonperiodic.cc, line 1472: unexpected NSource != Sp->NumPart
> >
> > ```
> >
> > Please let me know if you need other information about the test.
> >
> >
> > Another unrelated question, I think some of my test/config parameters
> are not setting properly. As you can see from the last time step in the
> cpu.txt file, treeimbalance occupies half of the cpu time. Do you have any
> suggestions on how to improve this?
> > ```
> > Step 1925, Time: 0.139002, CPUs: 128, HighestActiveTimeBin: 15
> > diff cumulative
> > total 0.01 100.0% 15689.45 100.0%
> > treegrav 0.01 72.6% 12704.52 81.0%
> > treebuild 0.01 52.7% 324.26 2.1%
> > insert 0.00 18.8% 224.70 1.4%
> > branches 0.00 0.2% 9.11 0.1%
> > toplevel 0.00 31.1% 28.95 0.2%
> > treeforce 0.00 0.6% 12362.98 78.8%
> > treewalk 0.00 0.0% 4490.18 28.6%
> > treeimbalance 0.00 0.2% 7867.96 50.1%
> > treefetch 0.00 0.0% 0.05 0.0%
> > treestack 0.00 0.3% 4.80 0.0%
> > pm_grav 0.00 0.0% 1915.98 12.2%
> > ngbtreevelupdate 0.00 0.1% 0.07 0.0%
> > ngbtreehsmlupdate 0.00 0.3% 0.10 0.0%
> > sph 0.00 0.0% 0.00 0.0%
> > density 0.00 0.0% 0.00 0.0%
> > densitywalk 0.00 0.0% 0.00 0.0%
> > densityfetch 0.00 0.0% 0.00 0.0%
> > densimbalance 0.00 0.0% 0.00 0.0%
> > hydro 0.00 0.0% 0.00 0.0%
> > hydrowalk 0.00 0.0% 0.00 0.0%
> > hydrofetch 0.00 0.0% 0.00 0.0%
> > hydroimbalance 0.00 0.0% 0.00 0.0%
> > domain 0.00 0.0% 275.44 1.8%
> > peano 0.00 0.0% 44.75 0.3%
> > drift/kicks 0.00 3.2% 181.57 1.2%
> > timeline 0.00 0.0% 4.13 0.0%
> > treetimesteps 0.00 0.0% 0.00 0.0%
> > i/o 0.00 0.0% 400.66 2.6%
> > logs 0.00 20.1% 31.70 0.2%
> > fof 0.00 0.0% 39.95 0.3%
> > fofwalk 0.00 0.0% 2.20 0.0%
> > fofimbal 0.00 0.0% 2.91 0.0%
> > subfind 0.00 0.0% 20.92 0.1%
> > restart 0.00 0.0% 10.89 0.1%
> > misc 0.00 3.7% 58.77 0.4%
> > ```
> >
> > Many thanks.
> >
> > Best,
> > Weiguang
> >
> > -------------------------------------------
> > https://weiguangcui.github.io/
> >
> > -----------------------------------------------------------
> >
> > If you wish to unsubscribe from this mailing, send mail to
> > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe
> gadget-list
> > A web-archive of this mailing list is available here:
> > http://www.mpa-garching.mpg.de/gadget/gadget-list
>
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
>
Received on 2021-05-11 11:08:33

This archive was generated by hypermail 2.3.0 : 2022-09-01 14:03:43 CEST