Dear list members,
As a new GADGET4 user, I've encountered a yet unsolved problem, while
testing GADGET4 at my university's new HPC cluster on multiple nodes,
controlled by Slurm. I've seen a similar (newly posted) issue on this
mailing list, but I can't confirm whether both issues have the same origin.
I'm trying to run the "colliding galaxies" example using the provided
Config.sh and parameter file with OpenMPI 3.1.3. I've built G4 with gcc
8.3.0.
Usually what happens is that the simulation starts running normally, but
after some time (sometimes minutes, sometimes only after hours) it crashes
with a segmentation fault. I also can't confirm whether this crash is
consistent or not. I've tried to run GADGET4 on 4 nodes with 8 CPUs each
most of the time, and it crashed similarly after approximately 4-5 hours
after start.
Extra info:
I've attached two separate files, containing the last iteration of two
simulations before a crash. The file `log_tail.log` contains the usual
crash, which I've encountered every single time. The `log_tail2.log`
contains an "maybe useful anomaly", when GADGET4 seems to terminate because
of some failure in it's shared memory handler.
I would appreciate it very much if you could give any insight or advice on
how to eliminate this problem! If you require any further information,
please let me know.
Best Regards,
Balázs
Received on 2021-02-03 12:03:53