Crash involving UCX when calculating tree forces in GADGET-4

From: Ouellette, Aaron James <aaronjo2_at_illinois.edu>
Date: Tue, 22 Feb 2022 18:01:59 +0000

Hello all,

I'm trying to run a cosmological simulation with 512^3 particles that runs into the future past redshift zero. It ran fine up until about a=4, but then began experiencing random crashes. After each crash, I was still able to restart the run and it has successfully finished, but I would like to find more information about why the run was crashing.

I am using OpenMPI 4.1.2 compiled with UCX 1.12.0. It looks like each time the run crashed in the gwalk:gravity_tree function after a call to PMPI_Recv(), after which there is a large number of errors from UCX that seem to stem from the error "Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)". I'm not sure how to determine whether this is an issue in my MPI setup, a hardware issue, or a bug in GADGET.

I've attached an archive containing the relevant error log, the beginning of the output log, the slurm script used to submit the job, and the Config.sh used to compile GADGET. Also, some additional information about the cluster: I'm running GADGET on 5 nodes (Intel Xeon Gold 6248 CPUs, 40 cores each and InfiniBand connecting the nodes); and the kernel is actually pretty old, it's at version 3.10.0-1160.49.1.el7.

Additionally, not sure if it is relevant or not, but when the code runs the initial MPI healthtest, I get a performance variation across the MPI ranks of 0.93, much larger than 0.5. Again, I'm not sure if this is an indication of a hardware problem or an issue in my MPI setup.


Please let me know if there's any other useful information I can provide and thank you so much for any help.

Aaron Ouellette


Physics PhD student at University of Illinois Urbana-Champaign





Received on 2022-02-22 19:02:09

This archive was generated by hypermail 2.3.0 : 2022-09-01 14:03:43 CEST