Re: Hangup when generating initial conditions

From: Ken Osato <ken.osato_at_yukawa.kyoto-u.ac.jp>
Date: Mon, 7 Feb 2022 21:03:55 +0900

Dear Volker,

Thank you for your help. I tried running with "MPI_HYPERCUBE_ALLGATHERV"
but this time the run failed in Ewald table module. I've attached the
error log for this run below.
I also tried switching on "USE_MPIALLTOALLV_IN_DOMAINDECOMP" or
"ISEND_IRECV_IN_DOMAIN" but for both runs, the code failed due to
similar errors.
In all the runs above, I reduced the MPI size limit, i.e.,
MPI_MESSAGE_SIZELIMIT_IN_MB = 100, to avoid large communications.

Best regards,
Ken

> ==== backtrace (tid: 459742) ====
>  0 0x0000000000012b20 .annobin_sigaction.c()  sigaction.c:0
>  1 0x000000000085d219 I_MPI_memcpy_movsb()
> /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/i_mpi_memcpy_sse.h:11
>  2 0x000000000085d219 bdw_memcpy_write()
> /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_memcpy.h:162
>  3 0x000000000085c554 write_to_frame()
> /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_memcpy.h:478
>  4 0x000000000085c554 send_frame()
> /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_send.h:1212
>  5 0x0000000000853833 MPIDI_POSIX_eager_send()
> /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_send.h:1543
>  6 0x0000000000755399 MPIDI_POSIX_eager_send()
> /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/posix_eager_impl.h:37
>  7 0x0000000000755399 MPIDI_POSIX_am_isend()
> /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/posix_am.h:220
>  8 0x0000000000755399 MPIDI_SHM_am_isend()
> /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_am.h:49
>  9 0x0000000000755399 MPIDIG_isend_impl()
> /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/generic/mpidig_send.h:116
> 10 0x000000000075870e MPIDIG_am_isend()
> /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/generic/mpidig_send.h:172
> 11 0x000000000075870e MPIDIG_mpi_isend()
> /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/generic/mpidig_send.h:233
> 12 0x000000000075870e MPIDI_POSIX_mpi_isend()
> /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/posix_send.h:59
> 13 0x000000000075870e MPIDI_SHM_mpi_isend()
> /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_p2p.h:187
> 14 0x000000000075870e MPIDI_isend_unsafe()
> /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:314
> 15 0x000000000075870e MPIDI_isend_safe()
> /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:609
> 16 0x000000000075870e MPID_Isend()
> /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:828
> 17 0x000000000075870e PMPI_Sendrecv()
> /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/pt2pt/sendrecv.c:181
> 18 0x000000000044fa28 MPI_hypercube_Allgatherv()
> /home/uchu/ken.osato/Gadget-4/src/mpi_utils/hypercube_allgatherv.cc:47
> 19 0x000000000047d12f ewald::ewald_init()
> /home/uchu/ken.osato/Gadget-4/src/gravity/ewald.cc:208
> 20 0x0000000000405347 sim::begrun1()
> /home/uchu/ken.osato/Gadget-4/src/main/begrun.cc:222
> 21 0x000000000040f0f3 main()
> /home/uchu/ken.osato/Gadget-4/src/main/main.cc:220
> 22 0x0000000000023493 __libc_start_main()  ???:0
> 23 0x0000000000404cae _start()  ???:0


On 06/02/2022 19:22, Volker Springel wrote:
> Dear Keto,
>
> This looks very much like another MPI instability, I'm afraid, in a call of the native version of MPI_Allgatherv() of Intel MPI. This is one of the most general and complex collective communication calls... my experience is that many MPI libraries are not always stable for it (depending on the size of the transfer, the network stack, the phase of the moon, etc.), presumably due to their aggressive attempts to optimize execution time.
>
> This is also why there is the switch
>
> MPI_HYPERCUBE_ALLGATHERV
>
> in the code, which will replace the native MPI_Allgatherv() call with my own simple hypercupe algorithm based on MPI_Sendrecv(). I would suggest to switch this on and try again.
>
> Best regards,
> Volker
>
> ps: The hang you experienced should not be affected by any change since Dec 23, so it is probably not reproducable in detail, which would be again consistent with a flaky implementation of MPI_Allgatherv() in the library.
>
>> On 27. Jan 2022, at 13:38, Ken Osato <ken.osato_at_yukawa.kyoto-u.ac.jp> wrote:
>>
>> Dear Gadget users,
>>
>> Actually, I had a similar problem raised by Julianne, related to the routine shared_memory_handler(), when running gravity-only simulations with Gadget-4.
>> The error seems to occur for MPICH-based libraries since I'm also using Intel MPI (v. 2020.4.304) on our cluster.
>>
>> Volker has already fixed this issue and I've run the simulation in order to test the code in my environment.
>> First, I've run the simulation with 1024^3 particles and the run is successful without errors.
>> However, when I increase the number of particles to 2048^3, it hangs up in generating initial conditions.
>> This error occurs for both of analytic calculations (PowerSpectrumType=1) and loading table (PowerSpectrumType=2).
>> I attach the log files for this run.
>> When I ran the simulation with older version of Gadget-4 (Git commit b4bb065ce3dec478d2a2d7101cefc5f5faade084, Wed Dec 23 17:05:02 2020 +0100), there was no error for initial conditions.
>> I think the current error again might be related to the different implementation between OpenMPI and MPICH.
>>
>> There is also a quite minor error about finalization. I always find the following error message every time the job ends.
>>> Abort(806969615) on node 185 (rank 185 in comm 0): Fatal error in PMPI_Finalize: Other MPI error, error stack:
>>> PMPI_Finalize(214)...............: MPI_Finalize failed
>>> PMPI_Finalize(159)...............:
>>> MPID_Finalize(1288)..............:
>>> MPIDI_OFI_mpi_finalize_hook(1892): OFI domain close failed (ofi_init.c:1892:MPIDI_OFI_mpi_finalize_hook:Device or resource busy)
>> But it seems that the job successfully finishes, the log file (stdout) ends with "endrun called, calling MPI_Finalize() bye!".
>> And I found no errors in output snapshots. Probably, it might be also due to MPI libraries.
>>
>> Best regards,
>> Ken
>>
>> --
>> Ken Osato
>> Yukawa Institute for Theoretical Physics, Kyoto University
>> Kitashirakawa Oiwakecho, Sakyo-ku, Kyoto 606-8502, Japan
>> Tel: +81-75-753-7000
>> E-mail: ken.osato_at_yukawa.kyoto-u.ac.jp
>> <slurm-151171.out><log.txt>
>> -----------------------------------------------------------
>>
>> If you wish to unsubscribe from this mailing, send mail to
>> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
>> A web-archive of this mailing list is available here:
>> http://www.mpa-garching.mpg.de/gadget/gadget-list
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list

-- 
Ken Osato
Yukawa Institute for Theoretical Physics, Kyoto University
Kitashirakawa Oiwakecho, Sakyo-ku, Kyoto 606-8502, Japan
Tel: +81-75-753-7000
E-mail: ken.osato_at_yukawa.kyoto-u.ac.jp
Received on 2022-02-07 13:04:18

This archive was generated by hypermail 2.3.0 : 2022-09-01 14:03:43 CEST