Re: Error from function deal_with_sph_node_request from Goddard, Julianne on 2021-12-08 (GADGET General Discussion Mailing List)

From: Goddard, Julianne <Julianne.Goddard_at_uky.edu>
Date: Wed, 8 Dec 2021 01:52:51 +0000

Hello Volker,

I am sorry for taking so long to respond, I have been taking many steps to debug and unfortunately am still having the same issue. I have been running now only on two nodes and with the unedited version of Gadget4. I did try as you suggested to change the way I allocate tasks, and consulted with one of our cluster administrators on this, but so far have not seen a significant change in the healthtest or ability to avoid the deal_with_xxx_node_request() error.

I am truly grateful that you took the time to run my simulation, and I am curious to know if the simulation finished on your system without any errors while running in parallel? I suspect the issue has to do with my environment as it seems the error does not occur when run on other clusters.

Sincerely,

Julianne
________________________________
From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
Sent: Sunday, November 7, 2021 12:46 PM
To: Gadget General Discussion <gadget-list_at_MPA-Garching.MPG.DE>
Subject: Re: [gadget-list] Error from function deal_with_sph_node_request

CAUTION: External Sender

Hi Julianne,

I've tried your code and setup on our machine, using 2 nodes (40 cores/node) like you did. So far without luck in triggering the crash that you have seen.

In looking at your stdout files I noted something odd, however. Gadget reports the line
"Shared memory islands host a minimum of 44 and a maximum of 48 MPI ranks."
This means you place your MPI ranks asymmetrically onto the two nodes you're using. This is almost certainly unintended and is due to an incorrect slurm batch script. There you have used:

#SBATCH --nodes=2
#SBATCH --ntasks=92
mpirun -np 92 ./Gadget4 param.txt

This would only make sense if you have compute nodes with 46 cores... which don't exist as far as I know.

I would highly recommend to launch multi-node jobs through something like

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=40
mpirun -np $SLURM_NPROCS ./Gadget4 param.txt

where now the number 40 must be replaced by the number of physical cores you actually have on each compute node (assuming they are allocated exclusively to you). If you don't know this number for sure, you can place a "cat /proc/cpuinfo" into your job script and find out.

The latter way of launching your job will symmetrically place MPI ranks onto your compute nodes, and by using the automatically calculated number $SLURM_NPROCS in mpirun you avoid making accidental mistakes that lead to overcommitment of cores with multiple MPI ranks. I think this appears to have happened in your case based on the dismal communication performance reported by the health test for the hypercube pattern in one of your log file. Also, such an overcommitment combined with the asymmetry may well have caused the instability you have seen, although this is not fully certain.

In any case, you should fix this start-up issue, and I'd also recommend to enable the healthtest by default. Please let me know if you then still get the occasional crashes.

Best regards,
Volker

> On 3. Nov 2021, at 17:58, Goddard, Julianne <Julianne.Goddard_at_uky.edu> wrote:
>
> Hello Volker,
>
> Thank you for your time and advice, I have shared google drive folder to your email (vspringel_at_MPA-Garching.MPG.DE) containing everything from the gadget directory I have been using, please let me know if there is any problem with accessing them, or if I can provide additional information. I will happy to share in some other way, or add to the folder.
>
> Thank You Again,
> Julianne
> From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
> Sent: Wednesday, November 3, 2021 4:13 AM
> To: Gadget General Discussion <gadget-list_at_MPA-Garching.MPG.DE>
> Subject: Re: [gadget-list] Error from function deal_with_sph_node_request
>
> CAUTION: External Sender
>
>
> Hi Julianne,
>
> Thanks for patiently doing all these suggested tests. While they haven't fixed the problem, they at least confirmed that the error appears related to the communication routines in deal_with_sph_node_request(), and apparently even deal_with_gravity_node_request(). This is somewhat unexpected on my end as I have not seen this error myself so far - but that doesn't mean that everything is necessarily correct. The non-reproducability and rareness of your crashes suggests that there could potentially be a subtle race condition in the communication routines that has not yet been recognized, and for some reason occurs on your machine, but not the computers I had a chance to test on.
>
> It would be helpful if you could make the exact code you are running, the configuration+parameter files, the ICs, and a couple of stdout log files of the crashes available to me for download somewhere. I can then look into it somewhat more.
>
> Regards,
> Volker
>
>
> > On 2. Nov 2021, at 21:04, Goddard, Julianne <Julianne.Goddard_at_uky.edu> wrote:
> >
> > Dear Volker and Leonard and other list subscribers,
> >
> > Thank you very much for your time and suggestions, I am writing to update on the results of implementing these suggestions on my simulation.
> >
> > I first enabled PRESERVE_SHMEM_BINARY_INVARIANCE, to test whether this creates a reproducible problem, I changed nothing else and ran my simulation twice and found that I did get the same error, however for the first it occurred around z=11 and for the second around z=5, in both cases the simulation was started at z=99. I have since left this setting active, however it does not seem to make a difference in the seemingly random occurrences of the error.
> >
> > I next set ActivePartFracForNewDomainDecomp=0 without any other changes and found that I do still get the same error along with program termination (this time occurring around z=12), in this and all other cases if I try to restart the simulation, it will run a bit farther, but inevitably it seems the same termination and error message occur for some later redshift.
> >
> > Finally I tried changing the combination of compilers used, I have used intel18.0.3.222/impi2018.3.222, gnu8.3.0/mpich3.3 , gnu8.3.0/openmpi3.1.4, gnu8.3.0/openmpi4.1.1, with each of these the error looks different (for example with Openmpi3 the error is listed as Segmentation Fault (11)), but if I analyze the core dump with gdb debugger I can see that in every case the function that the simulation was working on when it crashed was either deal_with_sph_node_request(), or deal_with_gravity_node_request().
> >
> > I have found that if I use Gadget4 with no alterations to run the simulation I do get the same error, so I do not think it has only to do with the grackle cooling, though it could be contributing, and still if I run on only one node there is no problem, this error only occurs when running in parallel.
> >
> > I am now working on getting output from just before the crash to see what is happening and will update again with those results..
> >
> > Volker, I think I have tried all of the suggestions provided, but this final step to look right at the moment of the crash. Do you possibly have any additional suggestions or insights based on this new information, or on how to best look at the data right at the moment of the error?
> >
> > Thank you again for your time,
> > Julianne
> > From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
> > Sent: Monday, October 25, 2021 12:32 PM
> > To: Gadget General Discussion <gadget-list_at_MPA-Garching.MPG.DE>
> > Subject: Re: [gadget-list] Error from function deal_with_sph_node_request
> >
> > CAUTION: External Sender
> >
> >
> > Hi Julianne and Leonard,
> >
> > I have no clear answer to this problem as I yet have to run into it myself. Random occurences of the issue in multi-node configurations only in the SPH communication routines suggest that it may have something to do with the way the spin-locks in the neighbor tree are handled by the code, through the calls access.test_and_set() and access.clear(). This could either be because of a semantic error in the way the code is doing this (hopefully not), or it could be because the compiler is not respecting all aspects of the (still fairly recent) C++ memory model for concurrency correctly. Try to use a different C++ compiler, and/or a more recent version, to test for the latter.
> >
> > Occasional MPI errors could in principle also be an issue; I doubt that this is the cause for this issue here, but I would nevertheless suggest to try another MPI library as well (I recently had good experiences with OpenMPI, which tends to be quite stable for 4.x).
> >
> > Other than that, you can try as an experiment to disable all forward predictions (which also involve spin locks) of SPH neighbor search nodes in the code by forcing a tree construction every step. The simplest (if costly) way to do this would be to set ActivePartFracForNewDomainDecomp=0, which enforces a new domain decomposition every step. It would be interesting to know whether the problem is then still there, or not.
> >
> > Finally, things could also be related to your own code extensions, for example in the way you implement star formation and/or depletion of gas. For example, modifying MaxPart inconsistently across processors while the neighbor tree is still in use would trigger crashes of the kind you've seen.
> >
> > To have a chance for making the problem reproducible, I'd recommend to activate PRESERVE_SHMEM_BINARY_INVARIANCE and to make sure that all your grackle routines and SFR routines are binary reproducible when the same random number sequence is ensured. If the problem still persists and is not reproducible, one needs to add more debugging output in case the crash situation (the controlled termination of the code in src/mpi_utils/shared_mem_handler.cc, line 272) should occur.
> >
> > Best,
> > Volker
> >
> > > On 22. Oct 2021, at 23:00, Goddard, Julianne <Julianne.Goddard_at_uky.edu> wrote:
> > >
> > > Hello Leonard,
> > >
> > > Thank you for your reply, it is interesting that we are both experiencing the same problems. Yes, mine do seem random, there does not appear to be a pattern at all to the occurrence.
> > >
> > > The one thing I will mention is that I have experienced almost the same error when I ran the same simulation with cooling turned off, however then it was function deal_with_gravity_node_request rather than the sph. This was before I implemented grackle into the code. Since implementing grackle I have had no issue running the simulation with cooling turned off (I don’t know why this should happen, again it seems random).
> > >
> > > Sincerely,
> > > Julianne
> > >
> > >> On Oct 22, 2021, at 4:41 PM, Leonard Romano <leonard.romano_at_tum.de> wrote:
> > >>
> > >> 
> > >> CAUTION: External Sender
> > >>
> > >> Hello Julianne,
> > >>
> > >> I am also using Grackle for cooling and when I enable star formation, I encounter the same error. What bugged me the most is that it seems to happen at random, i.e. sometimes after few stars have spawned and sometimes only after hundreds or thousands have spawned.
> > >> Does your error occur at random too?
> > >> Unfortunately I did not have time to debug this problem yet, so if you or anyone has any ideas, it would be very welcome.
> > >> Though needless to say it seems very likely that these kinds of issues are related to our custom implementations of these sub grid physics (Grackle is not part of the public Gadget code), so most likely we will have to find our own solutions to the bugs in our own code...
> > >>
> > >> Best,
> > >> Leonard
> > >>
> > >>
> > >> On 22.10.21 22:14, Goddard, Julianne wrote:
> > >>> Hello Everyone,
> > >>>
> > >>> I am running a zoom-in cosmological simulation with periodic boundary conditions in Gadget4. I am using grackle for cooling and star formation is enabled. The zoom region in the simulation is about 1.5 Mpc in radius, and the effective resolution here is 1024^3. I have found that the code runs to completion if I run on only one node, however if I increase to two or more nodes I start to get one of the following errors:
> > >>>
> > >>> "Code termination on task=91, function deal_with_sph_node_request(), file src/mpi_utils/shared_mem_handler.cc, line 272: p=1564695652 MaxPart=5869 MaxNodes=13117"
> > >>>
> > >>> or
> > >>>
> > >>> "Fatal error in PMPI_Recv: Unknown error class, error stack:
> > >>> PMPI_Recv(171)........................: MPI_Recv(buf=0x7f63546475c0, count=8, MPI_BYTE, src=31, tag=10, MPI_COMM_WORLD, status=0x1) failed
> > >>> MPIDU_Complete_posted_with_error(1137): Process failed"
> > >>>
> > >>> I have once had the code complete running in parallel without experiencing these errors, but since I have not been able to replicate. Has anyone else experienced this type of error or have advice on how to fix the problem?
> > >>>
> > >>> Thank You,
> > >>>
> > >>> Julianne
> > >>>
> > >>>
> > >> --
> > >> ===================================================
> > >> Leonard Romano, B.Sc.（レオナルド・ロマノ）
> > >> Physics Department
> > >> Technical University of Munich (TUM), Germany
> > >> Theoretical Astrophysics Group
> > >> Department of Earth and Space Science
> > >> Graduate School of Science, Osaka University, Japan
> > >> he / him / his
> > >> ===================================================
> > >>
> > >>
> > >> -----------------------------------------------------------
> > >>
> > >> If you wish to unsubscribe from this mailing, send mail to
> > >> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> > >> A web-archive of this mailing list is available here:
> > >> https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mpa-garching.mpg.de%2Fgadget%2Fgadget-list&data=04%7C01%7Cjulianne.goddard%40uky.edu%7Ceebd04b0dcb844e4ae8708d9a217150c%7C2b30530b69b64457b818481cb53d42ae%7C0%7C0%7C637719042433245395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=KTIUtdSfx9v4pXm8%2FxB%2BFQLUeTVKEot4NxRlrLekEiQ%3D&reserved=0
> > >
> > > -----------------------------------------------------------
> > >
> > > If you wish to unsubscribe from this mailing, send mail to
> > > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> > > A web-archive of this mailing list is available here:
> > > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mpa-garching.mpg.de%2Fgadget%2Fgadget-list&data=04%7C01%7Cjulianne.goddard%40uky.edu%7Ceebd04b0dcb844e4ae8708d9a217150c%7C2b30530b69b64457b818481cb53d42ae%7C0%7C0%7C637719042433245395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=KTIUtdSfx9v4pXm8%2FxB%2BFQLUeTVKEot4NxRlrLekEiQ%3D&reserved=0
> >
> >
> >
> >
> > -----------------------------------------------------------
> >
> > If you wish to unsubscribe from this mailing, send mail to
> > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> > A web-archive of this mailing list is available here:
> > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mpa-garching.mpg.de%2Fgadget%2Fgadget-list&data=04%7C01%7Cjulianne.goddard%40uky.edu%7Ceebd04b0dcb844e4ae8708d9a217150c%7C2b30530b69b64457b818481cb53d42ae%7C0%7C0%7C637719042433245395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=KTIUtdSfx9v4pXm8%2FxB%2BFQLUeTVKEot4NxRlrLekEiQ%3D&reserved=0
> >
> > -----------------------------------------------------------
> >
> > If you wish to unsubscribe from this mailing, send mail to
> > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> > A web-archive of this mailing list is available here:
> > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mpa-garching.mpg.de%2Fgadget%2Fgadget-list&data=04%7C01%7Cjulianne.goddard%40uky.edu%7Ceebd04b0dcb844e4ae8708d9a217150c%7C2b30530b69b64457b818481cb53d42ae%7C0%7C0%7C637719042433245395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=KTIUtdSfx9v4pXm8%2FxB%2BFQLUeTVKEot4NxRlrLekEiQ%3D&reserved=0
>
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mpa-garching.mpg.de%2Fgadget%2Fgadget-list&data=04%7C01%7Cjulianne.goddard%40uky.edu%7Ceebd04b0dcb844e4ae8708d9a217150c%7C2b30530b69b64457b818481cb53d42ae%7C0%7C0%7C637719042433255385%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=2%2Bw3GhxJIW6T%2BZRPkXhs6pamwQtaaw%2B%2BoMOp780dDKE%3D&reserved=0
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mpa-garching.mpg.de%2Fgadget%2Fgadget-list&data=04%7C01%7Cjulianne.goddard%40uky.edu%7Ceebd04b0dcb844e4ae8708d9a217150c%7C2b30530b69b64457b818481cb53d42ae%7C0%7C0%7C637719042433255385%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=2%2Bw3GhxJIW6T%2BZRPkXhs6pamwQtaaw%2B%2BoMOp780dDKE%3D&reserved=0

-----------------------------------------------------------

If you wish to unsubscribe from this mailing, send mail to
minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
A web-archive of this mailing list is available here:
https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mpa-garching.mpg.de%2Fgadget%2Fgadget-list&data=04%7C01%7Cjulianne.goddard%40uky.edu%7Ceebd04b0dcb844e4ae8708d9a217150c%7C2b30530b69b64457b818481cb53d42ae%7C0%7C0%7C637719042433255385%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=2%2Bw3GhxJIW6T%2BZRPkXhs6pamwQtaaw%2B%2BoMOp780dDKE%3D&reserved=0
Received on 2021-12-08 02:53:10