Re: Segmentation fault with GADGET4 on multiple nodes

From: Ken Osato <ken.osato_at_iap.fr>
Date: Tue, 30 Mar 2021 00:08:00 +0200

Dear Gadget list members,

When I was running Gadget-4 with gravity-only setup, I've encountered
the similar error mentioned by Balázs recently. This type of error
occurs only for simulations with a large number of particles. As far as
I tested, the run with N = 1024^3 and 2048^3 failed due to the
segmentation fault.
Balázs reported "it crashed similarly after approximately 4-5 hours
after start." but I guess this segmentation fault occurs after a certain
number of time steps by running the code for long time. I restarted the
run from the latest restart file and it can run again for a while but
crushed at a different point. (Thus, by taking a short restart dumping
interval and restarting many times, I managed to finish the simulation
through the end...)

I've attached the first and last parts of the log file to show input
parameters and configurations, and the error message from stderr, which
indicates that there is something wrong in "shared_mem_handler.cc". The
input parameters are almost similar to those in "DM-L50-N128" example
except number of particles and box size.
I've run Gadget-4 with 1344 cores and Intel MPI library (version: 2019
Update 9), and with some debug options:
"PRESERVE_SHMEM_BINARY_INVARIANCE" and "DEBUG", but for some reasons,
core file has not been created... I'd appreciate any help very much.

Best,
Ken Osato


On 03/02/2021 12:03, Balázs Pál wrote:
> Dear list members,
>
> As a new GADGET4 user, I've encountered a yet unsolved problem, while
> testing GADGET4 at my university's new HPC cluster on multiple nodes,
> controlled by Slurm. I've seen a similar (newly posted) issue on this
> mailing list, but I can't confirm whether both issues have the same
> origin.
> I'm trying to run the "colliding galaxies" example using the provided
> Config.sh and parameter file with OpenMPI 3.1.3. I've built G4 with
> gcc 8.3.0.
>
> Usually what happens is that the simulation starts running normally,
> but after some time (sometimes minutes, sometimes only after hours) it
> crashes with a segmentation fault. I also can't confirm whether this
> crash is consistent or not. I've tried to run GADGET4 on 4 nodes with
> 8 CPUs each most of the time, and it crashed similarly after
> approximately 4-5 hours after start.
>
> Extra info:
> I've attached two separate files, containing the last iteration of two
> simulations before a crash. The file `log_tail.log` contains the usual
> crash, which I've encountered every single time. The `log_tail2.log`
> contains an "maybe useful anomaly", when GADGET4 seems to terminate
> because of some failure in it's shared memory handler.
>
> I would appreciate it very much if you could give any insight or
> advice on how to eliminate this problem! If you require any further
> information, please let me know.
>
> Best Regards,
> Balázs



Received on 2021-03-30 00:08:12

This archive was generated by hypermail 2.3.0 : 2022-09-01 14:03:43 CEST