Re: Error from function deal_with_sph_node_request

From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
Date: Mon, 25 Oct 2021 18:32:27 +0200

Hi Julianne and Leonard,

I have no clear answer to this problem as I yet have to run into it myself. Random occurences of the issue in multi-node configurations only in the SPH communication routines suggest that it may have something to do with the way the spin-locks in the neighbor tree are handled by the code, through the calls access.test_and_set() and access.clear(). This could either be because of a semantic error in the way the code is doing this (hopefully not), or it could be because the compiler is not respecting all aspects of the (still fairly recent) C++ memory model for concurrency correctly. Try to use a different C++ compiler, and/or a more recent version, to test for the latter.

Occasional MPI errors could in principle also be an issue; I doubt that this is the cause for this issue here, but I would nevertheless suggest to try another MPI library as well (I recently had good experiences with OpenMPI, which tends to be quite stable for 4.x).

Other than that, you can try as an experiment to disable all forward predictions (which also involve spin locks) of SPH neighbor search nodes in the code by forcing a tree construction every step. The simplest (if costly) way to do this would be to set ActivePartFracForNewDomainDecomp=0, which enforces a new domain decomposition every step. It would be interesting to know whether the problem is then still there, or not.

Finally, things could also be related to your own code extensions, for example in the way you implement star formation and/or depletion of gas. For example, modifying MaxPart inconsistently across processors while the neighbor tree is still in use would trigger crashes of the kind you've seen.

To have a chance for making the problem reproducible, I'd recommend to activate PRESERVE_SHMEM_BINARY_INVARIANCE and to make sure that all your grackle routines and SFR routines are binary reproducible when the same random number sequence is ensured. If the problem still persists and is not reproducible, one needs to add more debugging output in case the crash situation (the controlled termination of the code in src/mpi_utils/shared_mem_handler.cc, line 272) should occur.

Best,
Volker

> On 22. Oct 2021, at 23:00, Goddard, Julianne <Julianne.Goddard_at_uky.edu> wrote:
>
> Hello Leonard,
>
> Thank you for your reply, it is interesting that we are both experiencing the same problems. Yes, mine do seem random, there does not appear to be a pattern at all to the occurrence.
>
> The one thing I will mention is that I have experienced almost the same error when I ran the same simulation with cooling turned off, however then it was function deal_with_gravity_node_request rather than the sph. This was before I implemented grackle into the code. Since implementing grackle I have had no issue running the simulation with cooling turned off (I don’t know why this should happen, again it seems random).
>
> Sincerely,
> Julianne
>
>> On Oct 22, 2021, at 4:41 PM, Leonard Romano <leonard.romano_at_tum.de> wrote:
>>
>> 
>> CAUTION: External Sender
>>
>> Hello Julianne,
>>
>> I am also using Grackle for cooling and when I enable star formation, I encounter the same error. What bugged me the most is that it seems to happen at random, i.e. sometimes after few stars have spawned and sometimes only after hundreds or thousands have spawned.
>> Does your error occur at random too?
>> Unfortunately I did not have time to debug this problem yet, so if you or anyone has any ideas, it would be very welcome.
>> Though needless to say it seems very likely that these kinds of issues are related to our custom implementations of these sub grid physics (Grackle is not part of the public Gadget code), so most likely we will have to find our own solutions to the bugs in our own code...
>>
>> Best,
>> Leonard
>>
>>
>> On 22.10.21 22:14, Goddard, Julianne wrote:
>>> Hello Everyone,
>>>
>>> I am running a zoom-in cosmological simulation with periodic boundary conditions in Gadget4. I am using grackle for cooling and star formation is enabled. The zoom region in the simulation is about 1.5 Mpc in radius, and the effective resolution here is 1024^3. I have found that the code runs to completion if I run on only one node, however if I increase to two or more nodes I start to get one of the following errors:
>>>
>>> "Code termination on task=91, function deal_with_sph_node_request(), file src/mpi_utils/shared_mem_handler.cc, line 272: p=1564695652 MaxPart=5869 MaxNodes=13117"
>>>
>>> or
>>>
>>> "Fatal error in PMPI_Recv: Unknown error class, error stack:
>>> PMPI_Recv(171)........................: MPI_Recv(buf=0x7f63546475c0, count=8, MPI_BYTE, src=31, tag=10, MPI_COMM_WORLD, status=0x1) failed
>>> MPIDU_Complete_posted_with_error(1137): Process failed"
>>>
>>> I have once had the code complete running in parallel without experiencing these errors, but since I have not been able to replicate. Has anyone else experienced this type of error or have advice on how to fix the problem?
>>>
>>> Thank You,
>>>
>>> Julianne
>>>
>>>
>> --
>> ===================================================
>> Leonard Romano, B.Sc.(レオナルド・ロマノ)
>> Physics Department
>> Technical University of Munich (TUM), Germany
>> Theoretical Astrophysics Group
>> Department of Earth and Space Science
>> Graduate School of Science, Osaka University, Japan
>> he / him / his
>> ===================================================
>>
>>
>> -----------------------------------------------------------
>>
>> If you wish to unsubscribe from this mailing, send mail to
>> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
>> A web-archive of this mailing list is available here:
>> http://www.mpa-garching.mpg.de/gadget/gadget-list
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
Received on 2021-10-25 18:32:27

This archive was generated by hypermail 2.3.0 : 2022-09-01 14:03:43 CEST