Re: endrun(888) problem

From: <arlindo.trindade_at_gmail.com>
Date: Thu, 8 May 2014 13:46:14 +0100

Dear Volker Springel,

I've increased the number of particles to 2* 256^3 and the box to 100
Mpc/h. I've increased PartAllocFactor substantially ( from as low as 1.6 to
as high as 50!) and the code starts to perform the domain decomposition
but after a while still get the same error at line 371 in the forcetree.c
file.


For example in the output of a run I show below, I used the following
memory parameters:

PartAllocFactor 15
TreeAllocFactor 0.8
BufferSize 30


However the error persists if I use

PartAllocFactor 20
TreeAllocFactor 1.2
BufferSize 30

our even higher values.


I also change the values both of TOPNODEFACTOR and MAXTOPNODES, but I can't
get the problem solved.

Arlindo

*****************************************************************************************


This is Gadget, version `2.0'.

Running on 6 processors.

found 75 times in output-list.

Allocated 30 MByte communication buffer per processor.

Communication buffer has room for 714938 particles in gravity computation
Communication buffer has room for 245760 particles in density computation
Communication buffer has room for 196608 particles in hydro computation
Communication buffer has room for 182890 particles in domain decomposition


Hubble (internal units) = 100
G (internal units) = 43.0071
UnitMass_in_g = 1.989e+43
UnitTime_in_s = 3.08568e+19
UnitVelocity_in_cm_per_s = 100000
UnitDensity_in_cgs = 6.76991e-31
UnitEnergy_in_cgs = 1.989e+53

Task=0 FFT-Slabs=86
Task=1 FFT-Slabs=86
Task=2 FFT-Slabs=86
Task=3 FFT-Slabs=86
Task=4 FFT-Slabs=86
Task=5 FFT-Slabs=82

Allocated 6400 MByte for particle storage. 80

Allocated 3360 MByte for storage of SPH data. 84


reading file `/mnt/lustre/atrindade/2LPT/IC08_mpich2/ics.10' on task=0
(contains 2097152 particles.)
distributing this file to tasks 0-0
Type 0 (gas): 1048576 (tot= 0016777216) masstab=0.0661731
Type 1 (halo): 1048576 (tot= 0016777216) masstab=0.430125
Type 2 (disk): 0 (tot= 0000000000) masstab=0
Type 3 (bulge): 0 (tot= 0000000000) masstab=0
Type 4 (stars): 0 (tot= 0000000000) masstab=0
Type 5 (bndry): 0 (tot= 0000000000) masstab=0


reading file `/mnt/lustre/atrindade/2LPT/IC08_mpich2/ics.11' on task=1
(contains 2097152 particles.)
distributing this file to tasks 1-1
Type 0 (gas): 1048576 (tot= 0016777216) masstab=0.0661731
Type 1 (halo): 1048576 (tot= 0016777216) masstab=0.430125
Type 2 (disk): 0 (tot= 0000000000) masstab=0
Type 3 (bulge): 0 (tot= 0000000000) masstab=0
Type 4 (stars): 0 (tot= 0000000000) masstab=0
Type 5 (bndry): 0 (tot= 0000000000) masstab=0


reading file `/mnt/lustre/atrindade/2LPT/IC08_mpich2/ics.12' on task=2
(contains 2097152 particles.)
distributing this file to tasks 2-2
Type 0 (gas): 1048576 (tot= 0016777216) masstab=0.0661731
Type 1 (halo): 1048576 (tot= 0016777216) masstab=0.430125
Type 2 (disk): 0 (tot= 0000000000) masstab=0
Type 3 (bulge): 0 (tot= 0000000000) masstab=0
Type 4 (stars): 0 (tot= 0000000000) masstab=0
Type 5 (bndry): 0 (tot= 0000000000) masstab=0


reading file `/mnt/lustre/atrindade/2LPT/IC08_mpich2/ics.13' on task=3
(contains 2097152 particles.)
distributing this file to tasks 3-3
Type 0 (gas): 1048576 (tot= 0016777216) masstab=0.0661731
Type 1 (halo): 1048576 (tot= 0016777216) masstab=0.430125
Type 2 (disk): 0 (tot= 0000000000) masstab=0
Type 3 (bulge): 0 (tot= 0000000000) masstab=0
Type 4 (stars): 0 (tot= 0000000000) masstab=0
Type 5 (bndry): 0 (tot= 0000000000) masstab=0


reading file `/mnt/lustre/atrindade/2LPT/IC08_mpich2/ics.14' on task=4
(contains 2097152 particles.)
distributing this file to tasks 4-4
Type 0 (gas): 1048576 (tot= 0016777216) masstab=0.0661731
Type 1 (halo): 1048576 (tot= 0016777216) masstab=0.430125
Type 2 (disk): 0 (tot= 0000000000) masstab=0
Type 3 (bulge): 0 (tot= 0000000000) masstab=0
Type 4 (stars): 0 (tot= 0000000000) masstab=0
Type 5 (bndry): 0 (tot= 0000000000) masstab=0


reading file `/mnt/lustre/atrindade/2LPT/IC08_mpich2/ics.15' on task=5
(contains 2097152 particles.)
distributing this file to tasks 5-5
Type 0 (gas): 1048576 (tot= 0016777216) masstab=0.0661731
Type 1 (halo): 1048576 (tot= 0016777216) masstab=0.430125
Type 2 (disk): 0 (tot= 0000000000) masstab=0
Type 3 (bulge): 0 (tot= 0000000000) masstab=0
Type 4 (stars): 0 (tot= 0000000000) masstab=0
Type 5 (bndry): 0 (tot= 0000000000) masstab=0


reading file `/mnt/lustre/atrindade/2LPT/IC08_mpich2/ics.4' on task=0
(contains 2097152 particles.)
distributing this file to tasks 0-0
Type 0 (gas): 1048576 (tot= 0016777216) masstab=0.0661731
Type 1 (halo): 1048576 (tot= 0016777216) masstab=0.430125
Type 2 (disk): 0 (tot= 0000000000) masstab=0
Type 3 (bulge): 0 (tot= 0000000000) masstab=0
Type 4 (stars): 0 (tot= 0000000000) masstab=0
Type 5 (bndry): 0 (tot= 0000000000) masstab=0


reading file `/mnt/lustre/atrindade/2LPT/IC08_mpich2/ics.5' on task=1
(contains 2097152 particles.)
distributing this file to tasks 1-1
Type 0 (gas): 1048576 (tot= 0016777216) masstab=0.0661731
Type 1 (halo): 1048576 (tot= 0016777216) masstab=0.430125
Type 2 (disk): 0 (tot= 0000000000) masstab=0
Type 3 (bulge): 0 (tot= 0000000000) masstab=0
Type 4 (stars): 0 (tot= 0000000000) masstab=0
Type 5 (bndry): 0 (tot= 0000000000) masstab=0


reading file `/mnt/lustre/atrindade/2LPT/IC08_mpich2/ics.6' on task=2
(contains 2097152 particles.)
distributing this file to tasks 2-2
Type 0 (gas): 1048576 (tot= 0016777216) masstab=0.0661731
Type 1 (halo): 1048576 (tot= 0016777216) masstab=0.430125
Type 2 (disk): 0 (tot= 0000000000) masstab=0
Type 3 (bulge): 0 (tot= 0000000000) masstab=0
Type 4 (stars): 0 (tot= 0000000000) masstab=0
Type 5 (bndry): 0 (tot= 0000000000) masstab=0


reading file `/mnt/lustre/atrindade/2LPT/IC08_mpich2/ics.7' on task=3
(contains 2097152 particles.)
distributing this file to tasks 3-3
Type 0 (gas): 1048576 (tot= 0016777216) masstab=0.0661731
Type 1 (halo): 1048576 (tot= 0016777216) masstab=0.430125
Type 2 (disk): 0 (tot= 0000000000) masstab=0
Type 3 (bulge): 0 (tot= 0000000000) masstab=0
Type 4 (stars): 0 (tot= 0000000000) masstab=0
Type 5 (bndry): 0 (tot= 0000000000) masstab=0


reading file `/mnt/lustre/atrindade/2LPT/IC08_mpich2/ics.8' on task=4
(contains 2097152 particles.)
distributing this file to tasks 4-4
Type 0 (gas): 1048576 (tot= 0016777216) masstab=0.0661731
Type 1 (halo): 1048576 (tot= 0016777216) masstab=0.430125
Type 2 (disk): 0 (tot= 0000000000) masstab=0
Type 3 (bulge): 0 (tot= 0000000000) masstab=0
Type 4 (stars): 0 (tot= 0000000000) masstab=0
Type 5 (bndry): 0 (tot= 0000000000) masstab=0


reading file `/mnt/lustre/atrindade/2LPT/IC08_mpich2/ics.9' on task=5
(contains 2097152 particles.)
distributing this file to tasks 5-5
Type 0 (gas): 1048576 (tot= 0016777216) masstab=0.0661731
Type 1 (halo): 1048576 (tot= 0016777216) masstab=0.430125
Type 2 (disk): 0 (tot= 0000000000) masstab=0
Type 3 (bulge): 0 (tot= 0000000000) masstab=0
Type 4 (stars): 0 (tot= 0000000000) masstab=0
Type 5 (bndry): 0 (tot= 0000000000) masstab=0


reading file `/mnt/lustre/atrindade/2LPT/IC08_mpich2/ics.0' on task=0
(contains 2097152 particles.)
distributing this file to tasks 0-0
Type 0 (gas): 1048576 (tot= 0016777216) masstab=0.0661731
Type 1 (halo): 1048576 (tot= 0016777216) masstab=0.430125
Type 2 (disk): 0 (tot= 0000000000) masstab=0
Type 3 (bulge): 0 (tot= 0000000000) masstab=0
Type 4 (stars): 0 (tot= 0000000000) masstab=0
Type 5 (bndry): 0 (tot= 0000000000) masstab=0


reading file `/mnt/lustre/atrindade/2LPT/IC08_mpich2/ics.1' on task=1
(contains 2097152 particles.)
distributing this file to tasks 1-2
Type 0 (gas): 1048576 (tot= 0016777216) masstab=0.0661731
Type 1 (halo): 1048576 (tot= 0016777216) masstab=0.430125
Type 2 (disk): 0 (tot= 0000000000) masstab=0
Type 3 (bulge): 0 (tot= 0000000000) masstab=0
Type 4 (stars): 0 (tot= 0000000000) masstab=0
Type 5 (bndry): 0 (tot= 0000000000) masstab=0


reading file `/mnt/lustre/atrindade/2LPT/IC08_mpich2/ics.2' on task=3
(contains 2097152 particles.)
distributing this file to tasks 3-3
Type 0 (gas): 1048576 (tot= 0016777216) masstab=0.0661731
Type 1 (halo): 1048576 (tot= 0016777216) masstab=0.430125
Type 2 (disk): 0 (tot= 0000000000) masstab=0
Type 3 (bulge): 0 (tot= 0000000000) masstab=0
Type 4 (stars): 0 (tot= 0000000000) masstab=0
Type 5 (bndry): 0 (tot= 0000000000) masstab=0


reading file `/mnt/lustre/atrindade/2LPT/IC08_mpich2/ics.3' on task=4
(contains 2097152 particles.)
distributing this file to tasks 4-5
Type 0 (gas): 1048576 (tot= 0016777216) masstab=0.0661731
Type 1 (halo): 1048576 (tot= 0016777216) masstab=0.430125
Type 2 (disk): 0 (tot= 0000000000) masstab=0
Type 3 (bulge): 0 (tot= 0000000000) masstab=0
Type 4 (stars): 0 (tot= 0000000000) masstab=0
Type 5 (bndry): 0 (tot= 0000000000) masstab=0

reading done.
Total number of particles : 0033554432

allocated 0.0762939 Mbyte for ngb search.

Allocated 4812.29 MByte for BH-tree. 64

domain decomposition...
NTopleaves= 127
work-load balance=6 memory-balance=6
exchange of 0027262976 particles
exchange of 0025434076 particles
exchange of 0023605176 particles
exchange of 0021776276 particles
exchange of 0019947376 particles
exchange of 0018118476 particles
exchange of 0016289576 particles
exchange of 0014460676 particles
exchange of 0012631776 particles
exchange of 0010802876 particles
exchange of 0008973976 particles
exchange of 0007145076 particles
exchange of 0005316176 particles
exchange of 0003487276 particles
exchange of 0001658376 particles
task 2: endrun called with an error level of 8882


task 1: endrun called with an error level of 8882


application called MPI_Abort(MPI_COMM_WORLD, 8882) - process 2
application called MPI_Abort(MPI_COMM_WORLD, 8882) - process 1





2014-05-07 20:56 GMT+01:00 Volker Springel <volker_at_mpa-garching.mpg.de>:

>
>
> I think Ali's hint is spot on. The simulation is likely so small that the
> ratio between the number of tree nodes required by the code and the
> particle number is much larger than normally needed for well-loaded
> processors. It should be possible to cure this with a (drastic) increase of
> the parameter 'TreeAllocFactor' in the parameter file.
>
> Volker
>
> On May 7, 2014, at 5:46 PM, Ali Snedden wrote:
>
> > You might read the user manual and check that you are using reasonable
> values in your parameter file (specifically All.PartAllocFactor). Perhaps
> All.MaxPart is too small or you're not using a large enough value for
> MaxNodes. Good luck.
> >
> >
> > ~ali
> >
> >
> > On Wed, May 7, 2014 at 11:31 AM, arlindo.trindade_at_gmail.com <
> arlindo.trindade_at_gmail.com> wrote:
> > Hi Ali,
> >
> > Thanks for your email.
> >
> > I've changed the values that are passed to each endrun as you suggested
> and the code breaks at line 371 in the forcetree.c file. But still I can't
> find out what causes this error.
> >
> > Do you have any ideas?
> >
> > Best regards,
> > Arlindo
> >
> >
> >
> > 2014-05-07 16:03 GMT+01:00 Ali Snedden <asnedden_at_nd.edu>:
> >
> > Hello Arlindo,
> >
> > It would be nice to know which line it actually breaks at. You could
> use a debugger like Gdb or just change the values passed to each line that
> has endrun(888).
> >
> > It is worth your time to learn how to use a debugger. >From my personal
> experience, using a debugger has probably cut my development time between
> 30-50% compared with just using print statements.
> >
> > To use Gdb for a parallel program. First compile Gadget with the '-g'
> compiler option. Then enter the following into your command line.
> >
> > mpirun -np 2 xterm -geometry 100x62 -sb -sl 10000 -e gdb
> ./Gadget2
> >
> > I added a bunch of extra options to customize the x11 window. Then in
> each of your two x11 windows enter the command line arguments (i.e. pass
> your parameter file).
> >
> > start lcdm.param
> >
> > Then you can set breakpoints and do other neat things by following these
> helpful websites.
> >
> > http://betterexplained.com/articles/debugging-with-gdb/
> > http://www.unknownroad.com/rtfm/gdbtut/gdbtoc.html
> >
> >
> > Best Regards
> > ~Ali
> >
> >
> > On Wed, May 7, 2014 at 10:42 AM, Arlindo Trindade _at_ gmail <
> arlindo.trindade_at_gmail.com> wrote:
> >
> > Hi all,
> >
> > I'm trying to running some N-body simulation tests with Gadget 2 on a
> cluster. The simulation is very very small, the number of particles is
> N=2*16^3 ( dark matter + gas) and the boxsize is L=3.125 Mpc. However I get
> a endrun 888 error. The same thing happens if I increase the size of the
> simulation ( both L and N). I've identified the files where the function
> endrun is called with the code 888 ( timestep.c lines 482 and 530,
> forcetree.c line 371) but I still can't understand what the problem is and
> thus I can't solve this problem.
> >
> > Does anyone has a suggestion?
> >
> > Cheers,
> > Arlindo
> >
> >
> >
> >
> > -----------------------------------------------------------
> > If you wish to unsubscribe from this mailing, send mail to
> > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe
> gadget-list
> > A web-archive of this mailing list is available here:
> > http://www.mpa-garching.mpg.de/gadget/gadget-list
> >
> >
> >
> > -----------------------------------------------------------
> >
> > If you wish to unsubscribe from this mailing, send mail to
> > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe
> gadget-list
> > A web-archive of this mailing list is available here:
> > http://www.mpa-garching.mpg.de/gadget/gadget-list
> >
> >
> >
> >
> > --
> > Arlindo Trindade
> >
> >
> > -----------------------------------------------------------
> >
> > If you wish to unsubscribe from this mailing, send mail to
> > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe
> gadget-list
> > A web-archive of this mailing list is available here:
> > http://www.mpa-garching.mpg.de/gadget/gadget-list
> >
> >
> >
> > -----------------------------------------------------------
> >
> > If you wish to unsubscribe from this mailing, send mail to
> > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe
> gadget-list
> > A web-archive of this mailing list is available here:
> > http://www.mpa-garching.mpg.de/gadget/gadget-list
>
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
>



-- 
Arlindo Trindade
Received on 2014-05-08 14:46:35

This archive was generated by hypermail 2.3.0 : 2023-01-10 10:01:32 CET