MPI_Sendrecv error in Gadget2

From: Michele Trenti <trenti_at_stsci.edu>
Date: Mon, 2 Oct 2006 16:03:43 -0400 (EDT)

Hi,

I am stuck with a MPI error in MPI_Sendrecv when I try to run "large" N
(400^3 on 18 cpus) simulations with Gadget2. The error (reported in the
log below) appears after the Tree force evaluation in the first step.
Smaller N runs (e.g. 256^3) complete smoothly in a few hours.

My understanding (see log of the error below) is that if the CPU time to
compute the tree force is too long (i.e. of the order of a few minutes),
the connection between the nodes is killed. However the MPI ring stays
on, i.e. I can do mpitrace and mpiringtest with no troubles after I get
the error message in the Gadget execution. Also, I have never been kicked
out of a ssh opened terminal when I connect to a node, even if I stay idle
for days.

I am using a linux cluster of 9 Sun Opteron Dual CPU hosts with
MPICH2-1.0.4.

Has anyone encountered a similar problem and/or can share some insight on
possible solutions?

Thanks a lot for your help,

Michele

-------------------------------------------------------------------
This is Gadget, version `2.0'.

Running on 18 processors.

found 14 times in output-list.

Allocated 40 MByte communication buffer per processor.

Communication buffer has room for 953250 particles in gravity computation
Communication buffer has room for 327680 particles in density computation
Communication buffer has room for 262144 particles in hydro computation
Communication buffer has room for 243854 particles in domain decomposition


Hubble (internal units) = 0.1
G (internal units) = 43007.1
UnitMass_in_g = 1.989e+43
UnitTime_in_s = 3.08568e+16
UnitVelocity_in_cm_per_s = 100000
UnitDensity_in_cgs = 6.76991e-22
UnitEnergy_in_cgs = 1.989e+53

Task=0 FFT-Slabs=15
Task=1 FFT-Slabs=15
Task=2 FFT-Slabs=15
Task=3 FFT-Slabs=15
Task=4 FFT-Slabs=15
Task=5 FFT-Slabs=15
Task=6 FFT-Slabs=15
Task=7 FFT-Slabs=15
Task=8 FFT-Slabs=15
Task=9 FFT-Slabs=15
Task=10 FFT-Slabs=15
Task=11 FFT-Slabs=15
Task=12 FFT-Slabs=15
Task=13 FFT-Slabs=15
Task=14 FFT-Slabs=15
Task=15 FFT-Slabs=15
Task=16 FFT-Slabs=15
Task=17 FFT-Slabs=1

Allocated 434.028 MByte for particle storage. 80


reading file `./ic/GadgetSnapshot_000' on task=0 (contains 64000000 particles.)
distributing this file to tasks 0-17
Type 0 (gas): 0 (tot= 0000000000) masstab=0
Type 1 (halo): 64000000 (tot= 0064000000) masstab=0.057824
Type 2 (disk): 0 (tot= 0000000000) masstab=0
Type 3 (bulge): 0 (tot= 0000000000) masstab=0
Type 4 (stars): 0 (tot= 0000000000) masstab=0
Type 5 (bndry): 0 (tot= 0000000000) masstab=0

reading done.
Total number of particles : 0064000000

allocated 0.0762939 Mbyte for ngb search.

Allocated 321.943 MByte for BH-tree. 64

domain decomposition...
NTopleaves= 512
work-load balance=1.02083 memory-balance=1.02083
exchange of 0060408642 particles
exchange of 0026642403 particles
exchange of 0005557184 particles
exchange of 0000825654 particles
domain decomposition done.
begin Peano-Hilbert order...
Peano-Hilbert done.
Begin Ngb-tree construction.
Ngb-Tree contruction finished

Setting next time for snapshot file to Time_next= 0.0243902


Begin Step 0, Time: 0.0123457, Redshift: 80, Systemstep: 0, Dloga: 0
domain decomposition...
NTopleaves= 512
work-load balance=1.02083 memory-balance=1.02083
domain decomposition done.
begin Peano-Hilbert order...
Peano-Hilbert done.
Start force computation...
Starting periodic PM calculation.

Allocated 17.6798 MByte for FFT data.

done PM.
Tree construction.
Tree construction done.
Begin tree force.
tree is done.
Begin tree force.
[cli_16]: aborting job:
Fatal error in MPI_Sendrecv: Other MPI error, error stack:
MPI_Sendrecv(217).........................: MPI_Sendrecv(sbuf=0x2a97d77f08, scount=488032, MPI_BYTE, dest=11, stag=18, rbuf=0x2a98d9b2e8, rcount=623888, MPI_BYTE, src=11, rtag=18, MPI_COMM_WORLD, status=0x7fbfffee00) failed
MPIDI_CH3_Progress_wait(217)..............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(608):
MPIDU_Socki_handle_pollhup(439)...........: connection closed by peer (set=0,sock=16)
[cli_11]: aborting job:
Fatal error in MPI_Sendrecv: Other MPI error, error stack:
MPI_Sendrecv(217).........................: MPI_Sendrecv(sbuf=0x2a98513668, scount=623888, MPI_BYTE, dest=16, stag=18, rbuf=0x2a990016b8, rcount=488032, MPI_BYTE, src=16, rtag=18, MPI_COMM_WORLD, status=0x7fbfffee00) failed
MPIDI_CH3_Progress_wait(217)..............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(608):
MPIDU_Socki_handle_pollhup(439)...........: connection closed by peer (set=0,sock=15)
rank 11 in job 1 udf2.stsci.edu_47530 caused collective abort of all ranks
   exit status of rank 11: killed by signal 9
-------------------------------------------------------------------
Received on 2006-10-02 22:03:50

This archive was generated by hypermail 2.3.0 : 2022-09-01 14:03:41 CEST