Re: Segmentation Fault on DMO runs on power9

From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
Date: Wed, 3 Mar 2021 14:09:06 +0100

Hi Tiago,

> On 3. Mar 2021, at 12:32, Tiago Castro <tiagobscastro_at_gmail.com> wrote:
>
> Many thanks, Volker.
>
> Hm, it possibly is a shared memory access problem given the place where this happens. Does the code run on a single node? Which MPI library is this? Certainly a buggy MPI-3 support is a primary suspect for this. It's also peculiar that the machine allows only 40% of the physical memory to be allocated as shared memory... (this is not good).
>
> The code did not run (crashed on the same part) on a single node. The MPI library is the one from IBM (I am running it on M100 cluster).

Ok, in principle this should be IBM's Spectrum MPI library, which is closely related to OpenMPI. However, on Marconi100, you should be able to use GNU/OpenMPI as an alternative by changing to the corresponding modules. At least on Intel processors, OpenMPI works well for Gadget4.

>
> You can try to activate DEBUG to see whether this gives a core file for the crash. This would allow to locate the line where this happens by loading the core-file with gdb.
>
> I asked support to run this, I have not used gdb on a mpi and batched jobs before. Get back to you once I manage to run this.
>
> Another possibility would be to add the attached stack-tracing class to the compiled files for Gagdet4. This will activate a signal handler and - if you are moderately lucky - print an informative stack-trace when the crash happens.
>
> I apologize for my ignorance, but I did not understand how to implement this.
>

You only need to move backward.cc/backward.h to a source directory (e.g. src/system), and include them in the makefile of Gadget4, like
OBJS += system/pinning.o system/system.o system/backward.o
INCL += system/system.h system/pinning.h system/backward.h
That's all, the constructor of the class will be called automatically on start-up without needing to modify any of the original code.

Regards,
Volker





> Many thanks!
> Tiago Castro Post Doc, Department of Physics / UNITS / OATS
> Phone: (+39 040 3199 120)
> Mobile: (+39 388 794 1562)
> Email: tiagobscastro_at_gmail.com
> Website: tiagobscastro.com
> Skype: tiagobscastro
> Address: Osservatorio Astronomico di Trieste / Villa Bazzoni
> Via Bazzoni, 2, 34143 Trieste TS
>
>
>
>
> Em qui., 25 de fev. de 2021 às 15:59, Volker Springel <vspringel_at_mpa-garching.mpg.de> escreveu:
> Hi Tiago,
>
> Hm, it possibly is a shared memory access problem given the place where this happens. Does the code run on a single node? Which MPI library is this? Certainly a buggy MPI-3 support is a primary suspect for this. It's also peculiar that the machine allows only 40% of the physical memory to be allocated as shared memory... (this is not good).
>
> You can try to activate DEBUG to see whether this gives a core file for the crash. This would allow to locate the line where this happens by loading the core-file with gdb.
>
> Another possibility would be to add the attached stack-tracing class to the compiled files for Gagdet4. This will activate a signal handler and - if you are moderately lucky - print an informative stack-trace when the crash happens.
>
> Regards,
> Volker
>
>
>
>
> > On 25. Feb 2021, at 15:18, Tiago Castro <tiagobscastro_at_gmail.com> wrote:
> >
> > Dear list,
> >
> > I have tried to run g4 on a power9 cluster, and right after the IC creation and during the first step the code returns me segmentation fault. Any suggestions of what I am doing wrong?
> >
> > Many thanks for any help you can provide.
> > Regards,
> > T.
> > <param.std.txt><Config.sh><slurm-2608670.out>
> > -----------------------------------------------------------
> >
> > If you wish to unsubscribe from this mailing, send mail to
> > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> > A web-archive of this mailing list is available here:
> > http://www.mpa-garching.mpg.de/gadget/gadget-list
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
Received on 2021-03-03 14:09:07

This archive was generated by hypermail 2.3.0 : 2023-01-10 10:01:32 CET