Re: Segmentation fault with GADGET4 on multiple nodes

From: Ken Osato <ken.osato_at_yukawa.kyoto-u.ac.jp>
Date: Sun, 4 Apr 2021 01:03:05 +0200

Dear Volker,

Thank you for your reply and suggestions.
> What compiler did you use? It might be worthwhile to try gcc if it was
> intel - or vice versa, and to use a lower optimization level as a
> cross check against compiler quirks.
The compiler I used was mpiicpc so I'll try with gcc. At this point, I
reproduced the error with any optimization level (from -fast to -O0).

> Another possibility would be that the MPI library does not allocate
> the shared memory fully correcly (through the MPI_Win_allocate_shared
> call). This feature of MPI-3 was introduced relatively recently into
> the standard and by my experience tends to be implemented less
> robustly than older MPI features.
I've just asked the admin of my cluster to change the configuration of
shared memory, and now, it is fully available. I'll try again the test
run to see how it works.
Also, I found out that intel MPI is built with MPI-3.1 and the error may
be caused by the old implementation of MPI as you pointed out.
I've actually run with other MPI library (OpenMPI-based one but
optimized for GPU, not for CPU and thus quite slow) and indeed it ran
without any error.

In any case, I think the most practical solution is to use another MPI
library compatible with MPI-4. I'll keep you posted if things work well.
Thank you again for your great help.

Best,
Ken


On 02/04/2021 12:02, Volker Springel wrote:
> Hi Ken,
>
> Thanks for reporting this issue. So far I haven't been able to
> reproduce this instability yet, but it is of course conceivable that
> it is a code
issue of some kind. But it could also be something related to the
software environment. If the crash always occurs in
shared_mem_handler.cc, this is at least a hint to possible causes.
>
> - A tricky part of the code lies in the use of atomic locks for
> protecting against race conditions in updating local tree branches. If
> there was
an error in the semantics of the code related to this (or in the way the
optimizer of the compiler deals with the C++ shared memory model), this
could show up in rare crashes that are not reproducible in detail. While
your symptoms are of this type, the particular location of the crash in
your stack trace (is it always the same location?) speaks against this.
What compiler did you use? It might be worthwhile to try gcc if it was
intel - or vice versa, and to use a lower optimization level as a cross
check against compiler quirks.
>
> - Another possibility would be that the MPI library does not allocate
> the shared memory fully correcly (through the MPI_Win_allocate_shared
> call). This feature of MPI-3 was introduced relatively recently into
> the standard and by my experience tends to be implemented less
> robustly than older MPI features. One special aspect of your run is
> that you have nodes with huge memory (awesome... 32GB/core). Due to a
> misconfiguration of /dev/shm, you can use only half of the physical
> memory as shared memory, but at
least for your current setup, this is not a limitation, since it should
be able to run with ~2GB/core or so. What is super strange, however, is
the log-file line "MALLOC: Allocation of shared memory took 94.0149
sec". Normally, this should take fractions of a second at most. So
something unusual is going on when the code executes
MPI_Win_allocate_shared(), and it could indicate that the MPI library
struggles with this for some reason. You are currently allocating ~13 GB
per !
> MPI rank, much more than you need for your run (check memory.txt for
> your actual needs). It would be interesting to see what happens if you
> drop this number, and whether or not this makes the instability go
> away. Also, I would recommend to try with another MPI library, say
> OpenMPI-4. If the instability is absent with this, it would indeed
> point to the MPI library as root cause. Currently, I consider this the
> most likely explanation among several vague ones, see above.
>
> Best,
> Volker
>
>
>> On 30. Mar 2021, at 00:08, Ken Osato <ken.osato_at_iap.fr> wrote:
>>
>> Dear Gadget list members,
>>
>> When I was running Gadget-4 with gravity-only setup, I've encountered
>> the similar error mentioned by Balázs recently. This type of error
>> occurs only for simulations with a large number of particles. As far
>> as I tested, the run with N = 1024^3 and 2048^3 failed due to the
>> segmentation fault.
>> Balázs reported "it crashed similarly after approximately 4-5 hours
>> after start." but I guess this segmentation fault occurs after a
>> certain number of time steps by running the code for long time. I
>> restarted the run from the latest restart file and it can run again
>> for a while but crushed at a different point. (Thus, by taking a
>> short restart dumping interval and restarting many times, I managed
>> to finish the simulation through the end...)
>>
>> I've attached the first and last parts of the log file to show input
>> parameters and configurations, and the error message from stderr,
>> which indicates that there is something wrong in
>> "shared_mem_handler.cc". The input parameters are almost similar to
>> those in "DM-L50-N128" example except
number of particles and box size.
>> I've run Gadget-4 with 1344 cores and Intel MPI library (version: 2019
Update 9), and with some debug options:
"PRESERVE_SHMEM_BINARY_INVARIANCE" and "DEBUG", but for some reasons,
core file has not been created... I'd appreciate any help very much.
>>
>> Best,
>> Ken Osato
>>
>>
>> On 03/02/2021 12:03, Balázs Pál wrote:
>>> Dear list members,
>>>
>>> As a new GADGET4 user, I've encountered a yet unsolved problem, while
testing GADGET4 at my university's new HPC cluster on multiple nodes,
controlled by Slurm. I've seen a similar (newly posted) issue on this
mailing list, but I can't confirm whether both issues have the same origin.
>>> I'm trying to run the "colliding galaxies" example using the provided
Config.sh and parameter file with OpenMPI 3.1.3. I've built G4 with gcc
8.3.0.
>>>
>>> Usually what happens is that the simulation starts running normally,
>>> but after some time (sometimes minutes, sometimes only after hours)
>>> it crashes with a segmentation fault. I also can't confirm whether
>>> this crash is consistent or not. I've tried to run GADGET4 on 4
>>> nodes with 8 CPUs each most of the time, and it crashed similarly
>>> after approximately 4-5 hours after start.
>>>
>>> Extra info:
>>> I've attached two separate files, containing the last iteration of
>>> two simulations before a crash. The file `log_tail.log` contains the
>>> usual crash, which I've encountered every single time. The
>>> `log_tail2.log` contains an "maybe useful anomaly", when GADGET4
>>> seems to terminate because of some failure in it's shared memory
>>> handler.
>>>
>>> I would appreciate it very much if you could give any insight or
>>> advice on how to eliminate this problem! If you require any further
>>> information, please let me know.
>>>
>>> Best Regards,
>>> Balázs
>> <stderr.log><log.txt>
>> -----------------------------------------------------------
>>
>> If you wish to unsubscribe from this mailing, send mail to
>> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
>> A web-archive of this mailing list is available here:
>> http://www.mpa-garching.mpg.de/gadget/gadget-list
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list

-- 
Ken Osato
Yukawa Institute for Theoretical Physics, Kyoto University
Kitashirakawa Oiwakecho, Sakyo-ku, Kyoto 606-8502, Japan
Tel: +81-75-753-7000
E-mail: ken.osato_at_yukawa.kyoto-u.ac.jp
Received on 2021-04-04 01:03:24

This archive was generated by hypermail 2.3.0 : 2023-01-10 10:01:32 CET