Re: Segmentation fault with GADGET4 on multiple nodes from Balázs Pál on 2021-02-10 (GADGET General Discussion Mailing List)

From: Balázs Pál <masterdesky_at_gmail.com>
Date: Wed, 10 Feb 2021 17:07:06 +0100

Dear Volker,

Thank you for your reply and your detailed insights and tips! Based on this
information I've contacted the admins of the cluster too regarding the
possible server-side problems. As it turned out, they're currently
investigating very similar errors regarding other MPI applications. They
said, that the default connection type was also changed to Ethernet during
debugging runs in the past 1.5 weeks, that's why GADGET4 logged such high
times in some cases, as you pointed out.

For now, I'll wait for them to sort the debugging and configuration of MPI
out, and only then proceed with some tests using your tips, to see if my
problem still persists. Anyways, I'm grateful for your insights and advice
again!

Best regards.
Balázs

On Sun, 7 Feb 2021 at 11:06, Volker Springel <vspringel_at_mpa-garching.mpg.de>
wrote:

>
> Dear Balázs,
>
> Thanks a lot for the info. So far, I have not been able to reproduce any
> of the crashes you experienced on our systems. I've also tried an older
> version of OpenMPI, 3.1.2, but it still worked for me, for example when
> running on 2 nodes with 18 cores each.
>
> In looking at the beginnings of the log-files you've sent, I noticed a few
> strange thing however.
>
> For example, the times reported for the line
>
> INIT: success. took=2.01971 sec
>
> or the line
>
> DOMAIN: particle exchange done. (took 5.30099 sec)
>
> should be about ~100 times smaller. Both of these times are sensitive to
> the communication bandwidth - somehow you had a very slow conncetion in
> these tests. Are you running perhaps over an Ethernet connection only?
> (Then the cluster wouldn't really be useable in practice for multi-node
> runs). You can check with "orte-info" how OpenMPI was compiled, and what
> transport layer it supports - "MCA btl: openib" should be in there.
>
> Also, you can ENABLE_HEALTHTEST in Gadget4. It will then at the beginning
> inform about the MPI communication speed that is reached between nodes. A
> reasonable output for a two-node would for example look like this.
>
> HEALTHTEST: Internode cube: 11727.4 MB/s per pair 1.169%
> variation | Best=11796.3 on Task=0/Node=0, Worst=11659.3 on
> Task=17/Node=1, test took 0.00361688 sec
> HEALTHTEST: Intranode cube, 1st node: 7466.2 MB/s per pair 38.466%
> variation | Best=9584.84 on Task=16/Node=0, Worst=6712.89 on
> Task=12/Node=0, test took 0.00376691 sec
> HEALTHTEST: Iprobe for any message: 3.42673e-08 s per MPI_Ip 29.467%
> variation | Best=3.26881e-08 on Task=3/Node=freya107, Worst=4.27858e-08
> on Task=14/Node=freya107, tes
> t took 0.034106 sec
>
> Finally, you can enable DEBUG to (hopefully) get a core dump for your
> crash. You can load the core dump post mortem with a debugger and issue the
> "bt" backtrace command to get an idea about the function and line where you
> experienced the crash. This would evidently be helpful to know.
>
> To make different reruns behave identically at the binary level, you can
> activate PRESERVE_SHMEM_BINARY_INVARIANCE. In this case the calculations
> remain deterministic at the level of floating point round-off, and then
> normally code crashes should be exactly reproducable in reruns if there is
> a code bug (at least for most types of bugs). If they are not reproducable,
> then it could indeed well be a problem with your particular cluster or
> software setup/configuation.
>
> Regards,
> Volker
>
>
>
>
>
> > On 4. Feb 2021, at 21:27, Balázs Pál <masterdesky_at_gmail.com> wrote:
> >
> > Dear Volker,
> >
> > Yes, I'm sending the first 200 rows of the logs just to be sure in case
> of these two log files (named `log1` and `log2` respectively for the
> `log_tail` and `log_tail2` files). I'm also attaching the head and tail of
> two more, similar log files. Extra info: nodes of this cluster contains 18
> CPUs each.
> >
> > Regards,
> > Balázs
> >
> > On Wed, 3 Feb 2021 at 13:00, Volker Springel <
> vspringel_at_mpa-garching.mpg.de> wrote:
> >
> > Dear Balázs,
> >
> > Could you please also send the beginning of these log files?
> >
> > Thanks,
> > Volker
> >
> >
> > > On 3. Feb 2021, at 12:03, Balázs Pál <masterdesky_at_gmail.com> wrote:
> > >
> > > Dear list members,
> > >
> > > As a new GADGET4 user, I've encountered a yet unsolved problem, while
> testing GADGET4 at my university's new HPC cluster on multiple nodes,
> controlled by Slurm. I've seen a similar (newly posted) issue on this
> mailing list, but I can't confirm whether both issues have the same origin.
> > > I'm trying to run the "colliding galaxies" example using the provided
> Config.sh and parameter file with OpenMPI 3.1.3. I've built G4 with gcc
> 8.3.0.
> > >
> > > Usually what happens is that the simulation starts running normally,
> but after some time (sometimes minutes, sometimes only after hours) it
> crashes with a segmentation fault. I also can't confirm whether this crash
> is consistent or not. I've tried to run GADGET4 on 4 nodes with 8 CPUs each
> most of the time, and it crashed similarly after approximately 4-5 hours
> after start.
> > >
> > > Extra info:
> > > I've attached two separate files, containing the last iteration of two
> simulations before a crash. The file `log_tail.log` contains the usual
> crash, which I've encountered every single time. The `log_tail2.log`
> contains an "maybe useful anomaly", when GADGET4 seems to terminate because
> of some failure in it's shared memory handler.
> > >
> > > I would appreciate it very much if you could give any insight or
> advice on how to eliminate this problem! If you require any further
> information, please let me know.
> > >
> > > Best Regards,
> > > Balázs
> > > <log_tail2.log><log_tail.log>
> > > -----------------------------------------------------------
> > >
> > > If you wish to unsubscribe from this mailing, send mail to
> > > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe
> gadget-list
> > > A web-archive of this mailing list is available here:
> > > http://www.mpa-garching.mpg.de/gadget/gadget-list
> >
> >
> >
> >
> > -----------------------------------------------------------
> >
> > If you wish to unsubscribe from this mailing, send mail to
> > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe
> gadget-list
> > A web-archive of this mailing list is available here:
> > http://www.mpa-garching.mpg.de/gadget/gadget-list
> >
> <log3_tail_100.log><log1_head_200.log><log3_head_200.log><log4_head_200.log><log2_head_200.log><log4_tail_100.log>
> > -----------------------------------------------------------
> >
> > If you wish to unsubscribe from this mailing, send mail to
> > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe
> gadget-list
> > A web-archive of this mailing list is available here:
> > http://www.mpa-garching.mpg.de/gadget/gadget-list
>
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
>
Received on 2021-02-10 17:07:27