Re: Hangup when generating initial conditions from Volker Springel on 2022-02-06 (GADGET General Discussion Mailing List)

From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
Date: Sun, 6 Feb 2022 11:22:56 +0100

Dear Keto,

This looks very much like another MPI instability, I'm afraid, in a call of the native version of MPI_Allgatherv() of Intel MPI. This is one of the most general and complex collective communication calls... my experience is that many MPI libraries are not always stable for it (depending on the size of the transfer, the network stack, the phase of the moon, etc.), presumably due to their aggressive attempts to optimize execution time.

This is also why there is the switch

MPI_HYPERCUBE_ALLGATHERV

in the code, which will replace the native MPI_Allgatherv() call with my own simple hypercupe algorithm based on MPI_Sendrecv(). I would suggest to switch this on and try again.

Best regards,
Volker

ps: The hang you experienced should not be affected by any change since Dec 23, so it is probably not reproducable in detail, which would be again consistent with a flaky implementation of MPI_Allgatherv() in the library.

> On 27. Jan 2022, at 13:38, Ken Osato <ken.osato_at_yukawa.kyoto-u.ac.jp> wrote:
>
> Dear Gadget users,
>
> Actually, I had a similar problem raised by Julianne, related to the routine shared_memory_handler(), when running gravity-only simulations with Gadget-4.
> The error seems to occur for MPICH-based libraries since I'm also using Intel MPI (v. 2020.4.304) on our cluster.
>
> Volker has already fixed this issue and I've run the simulation in order to test the code in my environment.
> First, I've run the simulation with 1024^3 particles and the run is successful without errors.
> However, when I increase the number of particles to 2048^3, it hangs up in generating initial conditions.
> This error occurs for both of analytic calculations (PowerSpectrumType=1) and loading table (PowerSpectrumType=2).
> I attach the log files for this run.
> When I ran the simulation with older version of Gadget-4 (Git commit b4bb065ce3dec478d2a2d7101cefc5f5faade084, Wed Dec 23 17:05:02 2020 +0100), there was no error for initial conditions.
> I think the current error again might be related to the different implementation between OpenMPI and MPICH.
>
> There is also a quite minor error about finalization. I always find the following error message every time the job ends.
>> Abort(806969615) on node 185 (rank 185 in comm 0): Fatal error in PMPI_Finalize: Other MPI error, error stack:
>> PMPI_Finalize(214)...............: MPI_Finalize failed
>> PMPI_Finalize(159)...............:
>> MPID_Finalize(1288)..............:
>> MPIDI_OFI_mpi_finalize_hook(1892): OFI domain close failed (ofi_init.c:1892:MPIDI_OFI_mpi_finalize_hook:Device or resource busy)
> But it seems that the job successfully finishes, the log file (stdout) ends with "endrun called, calling MPI_Finalize() bye!".
> And I found no errors in output snapshots. Probably, it might be also due to MPI libraries.
>
> Best regards,
> Ken
>
> --
> Ken Osato
> Yukawa Institute for Theoretical Physics, Kyoto University
> Kitashirakawa Oiwakecho, Sakyo-ku, Kyoto 606-8502, Japan
> Tel: +81-75-753-7000
> E-mail: ken.osato_at_yukawa.kyoto-u.ac.jp
> <slurm-151171.out><log.txt>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
Received on 2022-02-06 11:22:58