Hi Julianne,
Thanks for giving me access to your cluster. There, I encountered the same issue - random crashes after running on multiple nodes for a couple of hours in the routine deal_with_sph_node_request() or sometimes in the routine deal_with_gravity_node_request(). None of these crashes was ever reproducible for me.... Also, I could not get rid off this with MPI library or compiler tweaks, consistent with your findings. But I never got this issue on any of the other machines I have access to.
After an extensive search, I could now identify the place where the problem originates, and I have found a simple work-around.
Before parallel tree-based calculations in Gadget4 are done, the MPI rank executing the routine shared_memory_handler() receives offset tables from the other MPI ranks that reside on the same shared memory node, which is then faciliting a direct access to their tree data. This happens via several calls of the function prepare_offset_table(), which in turn makes uee of a MPI_Gather call.
MPI_Gather is a blocking collective function according to the MPI standard, i.e. the function should only return on the target rank of the gather operation (which is the process of the shared memory handler) if the data has fully arrived. But for some reason, this is not always the case on your cluster... For every 10^5 executions or so of the tree access preparation, the MPI_Gather already returns prematurely, *before* the data is fully assembled in the receive buffer. As a result, deal_with_sph_node_request() seg-faults.
This is a bug on your system in some software or hardware layer below gadget4. It could be in the infiniband driver stack, in the HCA firmware, or possibly even in the linux kernel since the shared memory transport internally to MPI_Gather in this particular situation requires non-trivial locking and completion synchronizations (I note that the linux kernel your cluster is running is a truly ancient long-term support version from Redhat... version 3.10.0 which was first released in 2013... on our clusters we run, for example, 5.3.18).
I have found that putting in an MPI_Barrier into the code right after the MPI_Gather operations (using the same communicator just among the ranks in the shared memory node) fixes the above problem. The corresponding patch is the commit
https://gitlab.mpcdf.mpg.de/vrs/gadget4/-/commit/7359b7047192d90c784683377d70035f5e669786
in the public repository. If you add this to your code, your problem should be solved. It is plausible that this is also a solution for Xiaolong, who reported to have run into the same issue.
Best regards,
Volker
> On 15. Jan 2022, at 02:45, Goddard, Julianne <Julianne.Goddard_at_uky.edu> wrote:
>
> Hello,
>
> I have just a quick update on this issue regarding running on other clusters. Without making any changes to Gadget-4 I have run the same simulation using the AMD cluster mentioned in my last email and on Stampede-2 at TACC. On our AMD cluster the simulation ran to completion in parallel, however it did still crash twice while on the function deal_with_sph_node_request() and after being restarted passed the error both times.
>
> On Stampede2 I have experienced almost the same thing, the simulation ran until z=2.7 and then crashed with the same error, I have restarted and am waiting to see if it can also pass the error on this cluster.
>
> Best,
> Julianne
>
> From: Goddard, Julianne <Julianne.Goddard_at_uky.edu>
> Sent: Thursday, January 6, 2022 7:37 PM
> To: Gadget General Discussion <gadget-list_at_MPA-Garching.MPG.DE>
> Subject: Re: [gadget-list] Error from function deal_with_sph_node_request
>
> Hello Volker,
>
> Thank you for your reply and continued help. There is a a tmpfs filesystem mounted on /dev/shm, could this be the source of the error? I have also been able to run on an AMD cluster here at our University since my last email and have found that while I do still get the same error, the simulations runs much further, to about z=1.7 so far. I have restarted and am waiting to see if it will run farther.
>
> I have communicated with the system administrator, and he has said that you can access the cluster through XSEDE if you provide him with your login ID and DN. I will send you the details and his contact information privately. Please do let me know if I can provide anything further, I greatly appreciate your patience and dedication in helping me figure out this issue.
>
> Sincerely,
> Julianne
> From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
> Sent: Thursday, January 6, 2022 11:55 AM
> To: Gadget General Discussion <gadget-list_at_MPA-Garching.MPG.DE>
> Subject: Re: [gadget-list] Error from function deal_with_sph_node_request
>
> CAUTION: External Sender
>
>
> Dear Julianne,
>
> Thanks a lot for compiling the information I had suggested to gather. I agree with your assessment. The software environment and library look healthy and reasonable to me. And the three clusters you have tried in your institute look fine and have a modern infiniband network, again nothing particularly special. In principle gadget4 should run on these systems without the trouble you experienced. Alas, I'm still unable to reproduce the error you encountered on the systems I have access to. The fact that gadget3 and gizmo run fine on your systems suggests that it is likely some small peculiarity about the software environment or setup of your cluster that causes the instability in gadget4.
>
> One possibility would perhaps be the way shared memory is set-up on the cluster. When you do a "df -k", do you see a tmpfs filesystem mounted on /dev/shm? This would be normal, but does not necessarily have to be there.
>
> Other than that, I'm somewhat out of ideas and would have to try finding the cause myself. Perhaps you can ask your system managers if a test account for me could be created on your system. In this case, I'd be happy to try to track this down.
>
> Best,
> Volker
>
>
> > On 22. Dec 2021, at 15:01, Goddard, Julianne <Julianne.Goddard_at_uky.edu> wrote:
> >
> > Hello Volker,
> >
> > Thank you for your suggestions. The system I am running on uses an Infiniband network and I experience the same issue on all three of the clusters I have used on our system. Other members of my group have used these same clusters to successfully run similar simulations on Gadget3 and Enzo, and the same simulation has run successfully on this system using GIZMO.
> >
> > I used the "ldd" and "orte-info" commands as you suggested and have included the output, along with some details about the clusters I run on, in the document attached to this link (it was too large to include within the email). As far as I can tell the correct MPI libraries are being called.
> >
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F1ucK4yMf-Rj3BbMAZKGvZCDRgpQCVaoEp%2Fview%3Fusp%3Dsharing&data=04%7C01%7Cjulianne.goddard%40uky.edu%7C49269ac9d5cd4e65c80408d9d135c7af%7C2b30530b69b64457b818481cb53d42ae%7C0%7C0%7C637770851460272165%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=2h6mlkKDILDlJiUWtVxV0W5M18IIGsctm8WgefLvkto%3D&reserved=0
> > Meet Google Drive – One place for all your files
> > Google Drive is a free way to keep your files backed up and easy to reach from any phone, tablet, or computer. Start with 15GB of Google storage – free.
> > drive.google.com
> >
> >
> > A colleague ran my simulation (using the unedited version of Gadget4) on a different intel cluster using 3 nodes and did not experience the same type of error (though he did have a timestepping error, which he was able to avoid by decreasing the MaxSizeTimestep), this combined with your results leads me to strongly believe that this is a cluster specific issue. As far as being able to run personally on another cluster, I have been hoping to request time to run on another system through XSEDE, but was hoping to wait until I was sure the code was working successfully, but maybe it would be worthwhile to apply now just to test this issue.
> >
> > With gratitude for your time,
> > Julianne
> >
>
> >
> >
> >
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
Received on 2022-01-22 15:01:09