Re: Segmentation fault with GADGET4 on multiple nodes from Volker Springel on 2021-04-02 (GADGET General Discussion Mailing List)

From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
Date: Fri, 2 Apr 2021 12:02:08 +0200

Hi Ken,

Thanks for reporting this issue. So far I haven't been able to reproduce this instability yet, but it is of course conceivable that it is a code issue of some kind. But it could also be something related to the software environment. If the crash always occurs in shared_mem_handler.cc, this is at least a hint to possible causes.

- A tricky part of the code lies in the use of atomic locks for protecting against race conditions in updating local tree branches. If there was an error in the semantics of the code related to this (or in the way the optimizer of the compiler deals with the C++ shared memory model), this could show up in rare crashes that are not reproducible in detail. While your symptoms are of this type, the particular location of the crash in your stack trace (is it always the same location?) speaks against this. What compiler did you use? It might be worthwhile to try gcc if it was intel - or vice versa, and to use a lower optimization level as a cross check against compiler quirks.

- Another possibility would be that the MPI library does not allocate the shared memory fully correcly (through the MPI_Win_allocate_shared call). This feature of MPI-3 was introduced relatively recently into the standard and by my experience tends to be implemented less robustly than older MPI features. One special aspect of your run is that you have nodes with huge memory (awesome... 32GB/core). Due to a misconfiguration of /dev/shm, you can use only half of the physical memory as shared memory, but at least for your current setup, this is not a limitation, since it should be able to run with ~2GB/core or so. What is super strange, however, is the log-file line "MALLOC: Allocation of shared memory took 94.0149 sec". Normally, this should take fractions of a second at most. So something unusual is going on when the code executes MPI_Win_allocate_shared(), and it could indicate that the MPI library struggles with this for some reason. You are currently allocating ~13 GB per MPI rank, much more than you need
for your run (check memory.txt for your actual needs). It would be interesting to see what happens if you drop this number, and whether or not this makes the instability go away. Also, I would recommend to try with another MPI library, say OpenMPI-4. If the instability is absent with this, it would indeed point to the MPI library as root cause. Currently, I consider this the most likely explanation among several vague ones, see above.

Best,
Volker

> On 30. Mar 2021, at 00:08, Ken Osato <ken.osato_at_iap.fr> wrote:
>
> Dear Gadget list members,
>
> When I was running Gadget-4 with gravity-only setup, I've encountered the similar error mentioned by Balázs recently. This type of error occurs only for simulations with a large number of particles. As far as I tested, the run with N = 1024^3 and 2048^3 failed due to the segmentation fault.
> Balázs reported "it crashed similarly after approximately 4-5 hours after start." but I guess this segmentation fault occurs after a certain number of time steps by running the code for long time. I restarted the run from the latest restart file and it can run again for a while but crushed at a different point. (Thus, by taking a short restart dumping interval and restarting many times, I managed to finish the simulation through the end...)
>
> I've attached the first and last parts of the log file to show input parameters and configurations, and the error message from stderr, which indicates that there is something wrong in "shared_mem_handler.cc". The input parameters are almost similar to those in "DM-L50-N128" example except number of particles and box size.
> I've run Gadget-4 with 1344 cores and Intel MPI library (version: 2019 Update 9), and with some debug options: "PRESERVE_SHMEM_BINARY_INVARIANCE" and "DEBUG", but for some reasons, core file has not been created... I'd appreciate any help very much.
>
> Best,
> Ken Osato
>
>
> On 03/02/2021 12:03, Balázs Pál wrote:
>> Dear list members,
>>
>> As a new GADGET4 user, I've encountered a yet unsolved problem, while testing GADGET4 at my university's new HPC cluster on multiple nodes, controlled by Slurm. I've seen a similar (newly posted) issue on this mailing list, but I can't confirm whether both issues have the same origin.
>> I'm trying to run the "colliding galaxies" example using the provided Config.sh and parameter file with OpenMPI 3.1.3. I've built G4 with gcc 8.3.0.
>>
>> Usually what happens is that the simulation starts running normally, but after some time (sometimes minutes, sometimes only after hours) it crashes with a segmentation fault. I also can't confirm whether this crash is consistent or not. I've tried to run GADGET4 on 4 nodes with 8 CPUs each most of the time, and it crashed similarly after approximately 4-5 hours after start.
>>
>> Extra info:
>> I've attached two separate files, containing the last iteration of two simulations before a crash. The file `log_tail.log` contains the usual crash, which I've encountered every single time. The `log_tail2.log` contains an "maybe useful anomaly", when GADGET4 seems to terminate because of some failure in it's shared memory handler.
>>
>> I would appreciate it very much if you could give any insight or advice on how to eliminate this problem! If you require any further information, please let me know.
>>
>> Best Regards,
>> Balázs
> <stderr.log><log.txt>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
Received on 2021-04-02 12:02:09