Re: Crash involving UCX when calculating tree forces in GADGET-4

From: Ouellette, Aaron James <aaronjo2_at_illinois.edu>
Date: Tue, 22 Feb 2022 21:00:45 +0000

Hi Volker,

Thank you so much for the suggestion. I'll test it on my future runs.

Thanks,

Aaron
________________________________
From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
Sent: Tuesday, February 22, 2022 12:41 PM
To: Gadget General Discussion <gadget-list_at_MPA-Garching.MPG.DE>
Subject: Re: [gadget-list] Crash involving UCX when calculating tree forces in GADGET-4


Hi Aaron,

Thanks for reporting this issue. It looks to me quite simular to an MPI instability that I encountered on another cluster recently. In this case, it went away by adding the settings

export UCX_TLS=self,sm,ud
export UCX_UD_MLX5_RX_QUEUE_LEN=16384

in the job-script, before Gadget4 is launched. Worth a try perhaps.

Best regards,
Volker


> On 22. Feb 2022, at 19:01, Ouellette, Aaron James <aaronjo2_at_illinois.edu> wrote:
>
> Hello all,
>
> I'm trying to run a cosmological simulation with 512^3 particles that runs into the future past redshift zero. It ran fine up until about a=4, but then began experiencing random crashes. After each crash, I was still able to restart the run and it has successfully finished, but I would like to find more information about why the run was crashing.
>
> I am using OpenMPI 4.1.2 compiled with UCX 1.12.0. It looks like each time the run crashed in the gwalk:gravity_tree function after a call to PMPI_Recv(), after which there is a large number of errors from UCX that seem to stem from the error "Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)". I'm not sure how to determine whether this is an issue in my MPI setup, a hardware issue, or a bug in GADGET.
>
> I've attached an archive containing the relevant error log, the beginning of the output log, the slurm script used to submit the job, and the Config.sh used to compile GADGET. Also, some additional information about the cluster: I'm running GADGET on 5 nodes (Intel Xeon Gold 6248 CPUs, 40 cores each and InfiniBand connecting the nodes); and the kernel is actually pretty old, it's at version 3.10.0-1160.49.1.el7.
>
> Additionally, not sure if it is relevant or not, but when the code runs the initial MPI healthtest, I get a performance variation across the MPI ranks of 0.93, much larger than 0.5. Again, I'm not sure if this is an indication of a hardware problem or an issue in my MPI setup.
>
>
> Please let me know if there's any other useful information I can provide and thank you so much for any help.
>
> Aaron Ouellette
>
>
> Physics PhD student at University of Illinois Urbana-Champaign
>
> <logs.tar.xz>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> https://urldefense.com/v3/__http://www.mpa-garching.mpg.de/gadget/gadget-list__;!!DZ3fjg!o9o0D5aqpcRdFGO-rOyOf7OmxNMvz_rsXYFGzGjarIbbJ-Pd8eBgzDCmziMfISSu-Ss$




-----------------------------------------------------------

If you wish to unsubscribe from this mailing, send mail to
minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
A web-archive of this mailing list is available here:
https://urldefense.com/v3/__http://www.mpa-garching.mpg.de/gadget/gadget-list__;!!DZ3fjg!o9o0D5aqpcRdFGO-rOyOf7OmxNMvz_rsXYFGzGjarIbbJ-Pd8eBgzDCmziMfISSu-Ss$
Received on 2022-02-22 22:01:03

This archive was generated by hypermail 2.3.0 : 2023-01-10 10:01:33 CET