Re: Error from function deal_with_sph_node_request from Goddard, Julianne on 2022-01-15 (GADGET General Discussion Mailing List)

From: Goddard, Julianne <Julianne.Goddard_at_uky.edu>
Date: Sat, 15 Jan 2022 01:45:15 +0000

Hello,

I have just a quick update on this issue regarding running on other clusters. Without making any changes to Gadget-4 I have run the same simulation using the AMD cluster mentioned in my last email and on Stampede-2 at TACC. On our AMD cluster the simulation ran to completion in parallel, however it did still crash twice while on the function deal_with_sph_node_request() and after being restarted passed the error both times.

On Stampede2 I have experienced almost the same thing, the simulation ran until z=2.7 and then crashed with the same error, I have restarted and am waiting to see if it can also pass the error on this cluster.

Best,
Julianne

________________________________
From: Goddard, Julianne <Julianne.Goddard_at_uky.edu>
Sent: Thursday, January 6, 2022 7:37 PM
To: Gadget General Discussion <gadget-list_at_MPA-Garching.MPG.DE>
Subject: Re: [gadget-list] Error from function deal_with_sph_node_request

Hello Volker,

Thank you for your reply and continued help. There is a a tmpfs filesystem mounted on /dev/shm, could this be the source of the error? I have also been able to run on an AMD cluster here at our University since my last email and have found that while I do still get the same error, the simulations runs much further, to about z=1.7 so far. I have restarted and am waiting to see if it will run farther.

I have communicated with the system administrator, and he has said that you can access the cluster through XSEDE if you provide him with your login ID and DN. I will send you the details and his contact information privately. Please do let me know if I can provide anything further, I greatly appreciate your patience and dedication in helping me figure out this issue.

Sincerely,
Julianne
________________________________
From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
Sent: Thursday, January 6, 2022 11:55 AM
To: Gadget General Discussion <gadget-list_at_MPA-Garching.MPG.DE>
Subject: Re: [gadget-list] Error from function deal_with_sph_node_request

CAUTION: External Sender

Dear Julianne,

Thanks a lot for compiling the information I had suggested to gather. I agree with your assessment. The software environment and library look healthy and reasonable to me. And the three clusters you have tried in your institute look fine and have a modern infiniband network, again nothing particularly special. In principle gadget4 should run on these systems without the trouble you experienced. Alas, I'm still unable to reproduce the error you encountered on the systems I have access to. The fact that gadget3 and gizmo run fine on your systems suggests that it is likely some small peculiarity about the software environment or setup of your cluster that causes the instability in gadget4.

One possibility would perhaps be the way shared memory is set-up on the cluster. When you do a "df -k", do you see a tmpfs filesystem mounted on /dev/shm? This would be normal, but does not necessarily have to be there.

Other than that, I'm somewhat out of ideas and would have to try finding the cause myself. Perhaps you can ask your system managers if a test account for me could be created on your system. In this case, I'd be happy to try to track this down.

Best,
Volker

> On 22. Dec 2021, at 15:01, Goddard, Julianne <Julianne.Goddard_at_uky.edu> wrote:
>
> Hello Volker,
>
> Thank you for your suggestions. The system I am running on uses an Infiniband network and I experience the same issue on all three of the clusters I have used on our system. Other members of my group have used these same clusters to successfully run similar simulations on Gadget3 and Enzo, and the same simulation has run successfully on this system using GIZMO.
>
> I used the "ldd" and "orte-info" commands as you suggested and have included the output, along with some details about the clusters I run on, in the document attached to this link (it was too large to include within the email). As far as I can tell the correct MPI libraries are being called.
>
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F1ucK4yMf-Rj3BbMAZKGvZCDRgpQCVaoEp%2Fview%3Fusp%3Dsharing&data=04%7C01%7Cjulianne.goddard%40uky.edu%7C49269ac9d5cd4e65c80408d9d135c7af%7C2b30530b69b64457b818481cb53d42ae%7C0%7C0%7C637770851460272165%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=2h6mlkKDILDlJiUWtVxV0W5M18IIGsctm8WgefLvkto%3D&reserved=0
> Meet Google Drive – One place for all your files
> Google Drive is a free way to keep your files backed up and easy to reach from any phone, tablet, or computer. Start with 15GB of Google storage – free.
> drive.google.com
>
>
> A colleague ran my simulation (using the unedited version of Gadget4) on a different intel cluster using 3 nodes and did not experience the same type of error (though he did have a timestepping error, which he was able to avoid by decreasing the MaxSizeTimestep), this combined with your results leads me to strongly believe that this is a cluster specific issue. As far as being able to run personally on another cluster, I have been hoping to request time to run on another system through XSEDE, but was hoping to wait until I was sure the code was working successfully, but maybe it would be worthwhile to apply now just to test this issue.
>
> With gratitude for your time,
> Julianne
>

>
>
>
Received on 2022-01-15 02:45:33