Re: Error from function deal_with_sph_node_request from Volker Springel on 2021-11-03 (GADGET General Discussion Mailing List)

From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
Date: Wed, 3 Nov 2021 08:13:23 +0000

Hi Julianne,

Thanks for patiently doing all these suggested tests. While they haven't fixed the problem, they at least confirmed that the error appears related to the communication routines in deal_with_sph_node_request(), and apparently even deal_with_gravity_node_request(). This is somewhat unexpected on my end as I have not seen this error myself so far - but that doesn't mean that everything is necessarily correct. The non-reproducability and rareness of your crashes suggests that there could potentially be a subtle race condition in the communication routines that has not yet been recognized, and for some reason occurs on your machine, but not the computers I had a chance to test on.

It would be helpful if you could make the exact code you are running, the configuration+parameter files, the ICs, and a couple of stdout log files of the crashes available to me for download somewhere. I can then look into it somewhat more.

Regards,
Volker

> On 2. Nov 2021, at 21:04, Goddard, Julianne <Julianne.Goddard_at_uky.edu> wrote:
>
> Dear Volker and Leonard and other list subscribers,
>
> Thank you very much for your time and suggestions, I am writing to update on the results of implementing these suggestions on my simulation.
>
> I first enabled PRESERVE_SHMEM_BINARY_INVARIANCE, to test whether this creates a reproducible problem, I changed nothing else and ran my simulation twice and found that I did get the same error, however for the first it occurred around z=11 and for the second around z=5, in both cases the simulation was started at z=99. I have since left this setting active, however it does not seem to make a difference in the seemingly random occurrences of the error.
>
> I next set ActivePartFracForNewDomainDecomp=0 without any other changes and found that I do still get the same error along with program termination (this time occurring around z=12), in this and all other cases if I try to restart the simulation, it will run a bit farther, but inevitably it seems the same termination and error message occur for some later redshift.
>
> Finally I tried changing the combination of compilers used, I have used intel18.0.3.222/impi2018.3.222, gnu8.3.0/mpich3.3 , gnu8.3.0/openmpi3.1.4, gnu8.3.0/openmpi4.1.1, with each of these the error looks different (for example with Openmpi3 the error is listed as Segmentation Fault (11)), but if I analyze the core dump with gdb debugger I can see that in every case the function that the simulation was working on when it crashed was either deal_with_sph_node_request(), or deal_with_gravity_node_request().
>
> I have found that if I use Gadget4 with no alterations to run the simulation I do get the same error, so I do not think it has only to do with the grackle cooling, though it could be contributing, and still if I run on only one node there is no problem, this error only occurs when running in parallel.
>
> I am now working on getting output from just before the crash to see what is happening and will update again with those results..
>
> Volker, I think I have tried all of the suggestions provided, but this final step to look right at the moment of the crash. Do you possibly have any additional suggestions or insights based on this new information, or on how to best look at the data right at the moment of the error?
>
> Thank you again for your time,
> Julianne
> From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
> Sent: Monday, October 25, 2021 12:32 PM
> To: Gadget General Discussion <gadget-list_at_MPA-Garching.MPG.DE>
> Subject: Re: [gadget-list] Error from function deal_with_sph_node_request
>
> CAUTION: External Sender
>
>
> Hi Julianne and Leonard,
>
> I have no clear answer to this problem as I yet have to run into it myself. Random occurences of the issue in multi-node configurations only in the SPH communication routines suggest that it may have something to do with the way the spin-locks in the neighbor tree are handled by the code, through the calls access.test_and_set() and access.clear(). This could either be because of a semantic error in the way the code is doing this (hopefully not), or it could be because the compiler is not respecting all aspects of the (still fairly recent) C++ memory model for concurrency correctly. Try to use a different C++ compiler, and/or a more recent version, to test for the latter.
>
> Occasional MPI errors could in principle also be an issue; I doubt that this is the cause for this issue here, but I would nevertheless suggest to try another MPI library as well (I recently had good experiences with OpenMPI, which tends to be quite stable for 4.x).
>
> Other than that, you can try as an experiment to disable all forward predictions (which also involve spin locks) of SPH neighbor search nodes in the code by forcing a tree construction every step. The simplest (if costly) way to do this would be to set ActivePartFracForNewDomainDecomp=0, which enforces a new domain decomposition every step. It would be interesting to know whether the problem is then still there, or not.
>
> Finally, things could also be related to your own code extensions, for example in the way you implement star formation and/or depletion of gas. For example, modifying MaxPart inconsistently across processors while the neighbor tree is still in use would trigger crashes of the kind you've seen.
>
> To have a chance for making the problem reproducible, I'd recommend to activate PRESERVE_SHMEM_BINARY_INVARIANCE and to make sure that all your grackle routines and SFR routines are binary reproducible when the same random number sequence is ensured. If the problem still persists and is not reproducible, one needs to add more debugging output in case the crash situation (the controlled termination of the code in src/mpi_utils/shared_mem_handler.cc, line 272) should occur.
>
> Best,
> Volker
>
> > On 22. Oct 2021, at 23:00, Goddard, Julianne <Julianne.Goddard_at_uky.edu> wrote:
> >
> > Hello Leonard,
> >
> > Thank you for your reply, it is interesting that we are both experiencing the same problems. Yes, mine do seem random, there does not appear to be a pattern at all to the occurrence.
> >
> > The one thing I will mention is that I have experienced almost the same error when I ran the same simulation with cooling turned off, however then it was function deal_with_gravity_node_request rather than the sph. This was before I implemented grackle into the code. Since implementing grackle I have had no issue running the simulation with cooling turned off (I don’t know why this should happen, again it seems random).
> >
> > Sincerely,
> > Julianne
> >
> >> On Oct 22, 2021, at 4:41 PM, Leonard Romano <leonard.romano_at_tum.de> wrote:
> >>
> >> 
> >> CAUTION: External Sender
> >>
> >> Hello Julianne,
> >>
> >> I am also using Grackle for cooling and when I enable star formation, I encounter the same error. What bugged me the most is that it seems to happen at random, i.e. sometimes after few stars have spawned and sometimes only after hundreds or thousands have spawned.
> >> Does your error occur at random too?
> >> Unfortunately I did not have time to debug this problem yet, so if you or anyone has any ideas, it would be very welcome.
> >> Though needless to say it seems very likely that these kinds of issues are related to our custom implementations of these sub grid physics (Grackle is not part of the public Gadget code), so most likely we will have to find our own solutions to the bugs in our own code...
> >>
> >> Best,
> >> Leonard
> >>
> >>
> >> On 22.10.21 22:14, Goddard, Julianne wrote:
> >>> Hello Everyone,
> >>>
> >>> I am running a zoom-in cosmological simulation with periodic boundary conditions in Gadget4. I am using grackle for cooling and star formation is enabled. The zoom region in the simulation is about 1.5 Mpc in radius, and the effective resolution here is 1024^3. I have found that the code runs to completion if I run on only one node, however if I increase to two or more nodes I start to get one of the following errors:
> >>>
> >>> "Code termination on task=91, function deal_with_sph_node_request(), file src/mpi_utils/shared_mem_handler.cc, line 272: p=1564695652 MaxPart=5869 MaxNodes=13117"
> >>>
> >>> or
> >>>
> >>> "Fatal error in PMPI_Recv: Unknown error class, error stack:
> >>> PMPI_Recv(171)........................: MPI_Recv(buf=0x7f63546475c0, count=8, MPI_BYTE, src=31, tag=10, MPI_COMM_WORLD, status=0x1) failed
> >>> MPIDU_Complete_posted_with_error(1137): Process failed"
> >>>
> >>> I have once had the code complete running in parallel without experiencing these errors, but since I have not been able to replicate. Has anyone else experienced this type of error or have advice on how to fix the problem?
> >>>
> >>> Thank You,
> >>>
> >>> Julianne
> >>>
> >>>
> >> --
> >> ===================================================
> >> Leonard Romano, B.Sc.（レオナルド・ロマノ）
> >> Physics Department
> >> Technical University of Munich (TUM), Germany
> >> Theoretical Astrophysics Group
> >> Department of Earth and Space Science
> >> Graduate School of Science, Osaka University, Japan
> >> he / him / his
> >> ===================================================
> >>
> >>
> >> -----------------------------------------------------------
> >>
> >> If you wish to unsubscribe from this mailing, send mail to
> >> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> >> A web-archive of this mailing list is available here:
> >> https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mpa-garching.mpg.de%2Fgadget%2Fgadget-list&data=04%7C01%7Cjulianne.goddard%40uky.edu%7C03e3fde2ac5f40136c9108d997d58179%7C2b30530b69b64457b818481cb53d42ae%7C0%7C0%7C637707766590293247%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=LW38myFIl8XtfThspQKf%2BEu18sKmqiD2lj%2FQxS1g8XU%3D&reserved=0
> >
> > -----------------------------------------------------------
> >
> > If you wish to unsubscribe from this mailing, send mail to
> > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> > A web-archive of this mailing list is available here:
> > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mpa-garching.mpg.de%2Fgadget%2Fgadget-list&data=04%7C01%7Cjulianne.goddard%40uky.edu%7C03e3fde2ac5f40136c9108d997d58179%7C2b30530b69b64457b818481cb53d42ae%7C0%7C0%7C637707766590303241%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=lKhrKCbr1n4GnUH7kiK0LYWcmP0Khy%2FYCFbmPw8mqis%3D&reserved=0
>
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mpa-garching.mpg.de%2Fgadget%2Fgadget-list&data=04%7C01%7Cjulianne.goddard%40uky.edu%7C03e3fde2ac5f40136c9108d997d58179%7C2b30530b69b64457b818481cb53d42ae%7C0%7C0%7C637707766590303241%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=lKhrKCbr1n4GnUH7kiK0LYWcmP0Khy%2FYCFbmPw8mqis%3D&reserved=0
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
Received on 2021-11-03 09:13:24