Re: MPI_Sendrecv error in Gadget2 from Volker Springel on 2006-10-09 (GADGET General Discussion Mailing List)

From: Volker Springel <volker_at_MPA-Garching.MPG.DE>
Date: Mon, 09 Oct 2006 15:00:02 +0200

Michele Trenti wrote:
> Hi,
>
> I am stuck with a MPI error in MPI_Sendrecv when I try to run "large" N
> (400^3 on 18 cpus) simulations with Gadget2. The error (reported in the
> log below) appears after the Tree force evaluation in the first step.
> Smaller N runs (e.g. 256^3) complete smoothly in a few hours.
>
> My understanding (see log of the error below) is that if the CPU time to
> compute the tree force is too long (i.e. of the order of a few minutes),
> the connection between the nodes is killed. However the MPI ring stays
> on, i.e. I can do mpitrace and mpiringtest with no troubles after I get
> the error message in the Gadget execution. Also, I have never been kicked
> out of a ssh opened terminal when I connect to a node, even if I stay idle
> for days.
>
> I am using a linux cluster of 9 Sun Opteron Dual CPU hosts with
> MPICH2-1.0.4.
>
> Has anyone encountered a similar problem and/or can share some insight on
> possible solutions?

Hi Michele,

This is a peculiar problem, I agree. I'm not sure that the 'time-out'
explanation is the right one though. Normally, MPI should keep waiting.

From your log-file, it looks as if Cpu 11 has been brought down by a
KILL-signal (signal 9). This could arise if the machine was out of
memory and received this signal from the operating system. You could try
to monitor memory usage of the code with a tool like 'top' to see how
close you are to exhausting the physical memory of the machine(s).

Volker

>
> Thanks a lot for your help,
>
> Michele
>
> -------------------------------------------------------------------
> This is Gadget, version `2.0'.
>
> Running on 18 processors.
>
> found 14 times in output-list.
>
> Allocated 40 MByte communication buffer per processor.
>
> Communication buffer has room for 953250 particles in gravity computation
> Communication buffer has room for 327680 particles in density computation
> Communication buffer has room for 262144 particles in hydro computation
> Communication buffer has room for 243854 particles in domain decomposition
>
>
> Hubble (internal units) = 0.1
> G (internal units) = 43007.1
> UnitMass_in_g = 1.989e+43
> UnitTime_in_s = 3.08568e+16
> UnitVelocity_in_cm_per_s = 100000
> UnitDensity_in_cgs = 6.76991e-22
> UnitEnergy_in_cgs = 1.989e+53
>
> Task=0 FFT-Slabs=15
> Task=1 FFT-Slabs=15
> Task=2 FFT-Slabs=15
> Task=3 FFT-Slabs=15
> Task=4 FFT-Slabs=15
> Task=5 FFT-Slabs=15
> Task=6 FFT-Slabs=15
> Task=7 FFT-Slabs=15
> Task=8 FFT-Slabs=15
> Task=9 FFT-Slabs=15
> Task=10 FFT-Slabs=15
> Task=11 FFT-Slabs=15
> Task=12 FFT-Slabs=15
> Task=13 FFT-Slabs=15
> Task=14 FFT-Slabs=15
> Task=15 FFT-Slabs=15
> Task=16 FFT-Slabs=15
> Task=17 FFT-Slabs=1
>
> Allocated 434.028 MByte for particle storage. 80
>
>
> reading file `./ic/GadgetSnapshot_000' on task=0 (contains 64000000 particles.)
> distributing this file to tasks 0-17
> Type 0 (gas): 0 (tot= 0000000000) masstab=0
> Type 1 (halo): 64000000 (tot= 0064000000) masstab=0.057824
> Type 2 (disk): 0 (tot= 0000000000) masstab=0
> Type 3 (bulge): 0 (tot= 0000000000) masstab=0
> Type 4 (stars): 0 (tot= 0000000000) masstab=0
> Type 5 (bndry): 0 (tot= 0000000000) masstab=0
>
> reading done.
> Total number of particles : 0064000000
>
> allocated 0.0762939 Mbyte for ngb search.
>
> Allocated 321.943 MByte for BH-tree. 64
>
> domain decomposition...
> NTopleaves= 512
> work-load balance=1.02083 memory-balance=1.02083
> exchange of 0060408642 particles
> exchange of 0026642403 particles
> exchange of 0005557184 particles
> exchange of 0000825654 particles
> domain decomposition done.
> begin Peano-Hilbert order...
> Peano-Hilbert done.
> Begin Ngb-tree construction.
> Ngb-Tree contruction finished
>
> Setting next time for snapshot file to Time_next= 0.0243902
>
>
> Begin Step 0, Time: 0.0123457, Redshift: 80, Systemstep: 0, Dloga: 0
> domain decomposition...
> NTopleaves= 512
> work-load balance=1.02083 memory-balance=1.02083
> domain decomposition done.
> begin Peano-Hilbert order...
> Peano-Hilbert done.
> Start force computation...
> Starting periodic PM calculation.
>
> Allocated 17.6798 MByte for FFT data.
>
> done PM.
> Tree construction.
> Tree construction done.
> Begin tree force.
> tree is done.
> Begin tree force.
> [cli_16]: aborting job:
> Fatal error in MPI_Sendrecv: Other MPI error, error stack:
> MPI_Sendrecv(217).........................: MPI_Sendrecv(sbuf=0x2a97d77f08, scount=488032, MPI_BYTE, dest=11, stag=18, rbuf=0x2a98d9b2e8, rcount=623888, MPI_BYTE, src=11, rtag=18, MPI_COMM_WORLD, status=0x7fbfffee00) failed
> MPIDI_CH3_Progress_wait(217)..............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(608):
> MPIDU_Socki_handle_pollhup(439)...........: connection closed by peer (set=0,sock=16)
> [cli_11]: aborting job:
> Fatal error in MPI_Sendrecv: Other MPI error, error stack:
> MPI_Sendrecv(217).........................: MPI_Sendrecv(sbuf=0x2a98513668, scount=623888, MPI_BYTE, dest=16, stag=18, rbuf=0x2a990016b8, rcount=488032, MPI_BYTE, src=16, rtag=18, MPI_COMM_WORLD, status=0x7fbfffee00) failed
> MPIDI_CH3_Progress_wait(217)..............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(608):
> MPIDU_Socki_handle_pollhup(439)...........: connection closed by peer (set=0,sock=15)
> rank 11 in job 1 udf2.stsci.edu_47530 caused collective abort of all ranks
> exit status of rank 11: killed by signal 9
> -------------------------------------------------------------------
>
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
Received on 2006-10-09 15:00:02