Re: Segmentation fault with GADGET4 on multiple nodes from Volker Springel on 2021-02-07 (GADGET General Discussion Mailing List)

From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
Date: Sun, 7 Feb 2021 11:05:36 +0100

Dear Balázs,

Thanks a lot for the info. So far, I have not been able to reproduce any of the crashes you experienced on our systems. I've also tried an older version of OpenMPI, 3.1.2, but it still worked for me, for example when running on 2 nodes with 18 cores each.

In looking at the beginnings of the log-files you've sent, I noticed a few strange thing however.

For example, the times reported for the line

INIT: success. took=2.01971 sec

or the line

DOMAIN: particle exchange done. (took 5.30099 sec)

should be about ~100 times smaller. Both of these times are sensitive to the communication bandwidth - somehow you had a very slow conncetion in these tests. Are you running perhaps over an Ethernet connection only? (Then the cluster wouldn't really be useable in practice for multi-node runs). You can check with "orte-info" how OpenMPI was compiled, and what transport layer it supports - "MCA btl: openib" should be in there.

Also, you can ENABLE_HEALTHTEST in Gadget4. It will then at the beginning inform about the MPI communication speed that is reached between nodes. A reasonable output for a two-node would for example look like this.

HEALTHTEST: Internode cube: 11727.4 MB/s per pair 1.169% variation | Best=11796.3 on Task=0/Node=0, Worst=11659.3 on Task=17/Node=1, test took 0.00361688 sec
HEALTHTEST: Intranode cube, 1st node: 7466.2 MB/s per pair 38.466% variation | Best=9584.84 on Task=16/Node=0, Worst=6712.89 on Task=12/Node=0, test took 0.00376691 sec
HEALTHTEST: Iprobe for any message: 3.42673e-08 s per MPI_Ip 29.467% variation | Best=3.26881e-08 on Task=3/Node=freya107, Worst=4.27858e-08 on Task=14/Node=freya107, tes
t took 0.034106 sec

Finally, you can enable DEBUG to (hopefully) get a core dump for your crash. You can load the core dump post mortem with a debugger and issue the "bt" backtrace command to get an idea about the function and line where you experienced the crash. This would evidently be helpful to know.

To make different reruns behave identically at the binary level, you can activate PRESERVE_SHMEM_BINARY_INVARIANCE. In this case the calculations remain deterministic at the level of floating point round-off, and then normally code crashes should be exactly reproducable in reruns if there is a code bug (at least for most types of bugs). If they are not reproducable, then it could indeed well be a problem with your particular cluster or software setup/configuation.

Regards,
Volker

> On 4. Feb 2021, at 21:27, Balázs Pál <masterdesky_at_gmail.com> wrote:
>
> Dear Volker,
>
> Yes, I'm sending the first 200 rows of the logs just to be sure in case of these two log files (named `log1` and `log2` respectively for the `log_tail` and `log_tail2` files). I'm also attaching the head and tail of two more, similar log files. Extra info: nodes of this cluster contains 18 CPUs each.
>
> Regards,
> Balázs
>
> On Wed, 3 Feb 2021 at 13:00, Volker Springel <vspringel_at_mpa-garching.mpg.de> wrote:
>
> Dear Balázs,
>
> Could you please also send the beginning of these log files?
>
> Thanks,
> Volker
>
>
> > On 3. Feb 2021, at 12:03, Balázs Pál <masterdesky_at_gmail.com> wrote:
> >
> > Dear list members,
> >
> > As a new GADGET4 user, I've encountered a yet unsolved problem, while testing GADGET4 at my university's new HPC cluster on multiple nodes, controlled by Slurm. I've seen a similar (newly posted) issue on this mailing list, but I can't confirm whether both issues have the same origin.
> > I'm trying to run the "colliding galaxies" example using the provided Config.sh and parameter file with OpenMPI 3.1.3. I've built G4 with gcc 8.3.0.
> >
> > Usually what happens is that the simulation starts running normally, but after some time (sometimes minutes, sometimes only after hours) it crashes with a segmentation fault. I also can't confirm whether this crash is consistent or not. I've tried to run GADGET4 on 4 nodes with 8 CPUs each most of the time, and it crashed similarly after approximately 4-5 hours after start.
> >
> > Extra info:
> > I've attached two separate files, containing the last iteration of two simulations before a crash. The file `log_tail.log` contains the usual crash, which I've encountered every single time. The `log_tail2.log` contains an "maybe useful anomaly", when GADGET4 seems to terminate because of some failure in it's shared memory handler.
> >
> > I would appreciate it very much if you could give any insight or advice on how to eliminate this problem! If you require any further information, please let me know.
> >
> > Best Regards,
> > Balázs
> > <log_tail2.log><log_tail.log>
> > -----------------------------------------------------------
> >
> > If you wish to unsubscribe from this mailing, send mail to
> > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> > A web-archive of this mailing list is available here:
> > http://www.mpa-garching.mpg.de/gadget/gadget-list
>
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
> <log3_tail_100.log><log1_head_200.log><log3_head_200.log><log4_head_200.log><log2_head_200.log><log4_tail_100.log>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
Received on 2021-02-07 11:05:37