Re: Segmentation fault from Michele Trenti on 2006-12-09 (GADGET General Discussion Mailing List)

From: Michele Trenti <trenti_at_stsci.edu>
Date: Sat, 9 Dec 2006 14:00:00 -0500 (EST)

Hi Volker,

thanks for your suggestion.

I double checked the library installation, but it looks to me that the
FFTW library was compiled with both single and double precision for mpi
(see below). The library has also been compiled with the same
(Intel) compiler that I use for the Gadget2 sources.

I will now try to experiment with double precision and to switch to
gcc for both libraries and Gadget2 source.

Btw, the same 512^3 simulation (ICs, parameter file, Makefile options) is
now nicely running (40 steps and no troubles) on a 64 bit system at NCSA
(a SGI Altix). I compiled there with Intel and included single precision
FFTW.

The Xeon system on which I get segmentation fault is 32 bits. Could it be
that the source of the troubles is something related to sizes of
float/integers?

Thanks again,

Michele

----------------------
FFTW libraries included on the Xeon (32 bit) cluster
----------------------
[trenti_at_tunc ~]$ cd $FFTW_HOME/
[trenti_at_tunc intel-cmpi]$ cd include/
[trenti_at_tunc include]$ ls
dfftw.h drfftw_mpi.h rfftw.h srfftw.h
dfftw_mpi.h drfftw_threads.h sfftw.h srfftw_mpi.h
dfftw_threads.h fftw_f77.i sfftw_mpi.h srfftw_threads.h
drfftw.h fftw.h sfftw_threads.h
[trenti_at_tunc intel-cmpi]$ cd ../lib/
[trenti_at_tunc lib]$ ls
libdfftw.a libdrfftw_mpi.la libsfftw_threads.a
libdfftw.la libdrfftw_threads.a libsfftw_threads.la
libdfftw_mpi.a libdrfftw_threads.la libsrfftw.a
libdfftw_mpi.la libfftw.a libsrfftw.la
libdfftw_threads.a librfftw.a libsrfftw_mpi.a
libdfftw_threads.la libsfftw.a libsrfftw_mpi.la
libdrfftw.a libsfftw.la libsrfftw_threads.a
libdrfftw.la libsfftw_mpi.a libsrfftw_threads.la
libdrfftw_mpi.a libsfftw_mpi.la
[trenti_at_tunc lib]$
---------------------

Michele Trenti
Space Telescope Science Institute
3700 San Martin Drive Phone: +1 410 338 4987
Baltimore MD 21218 U.S. Fax: +1 410 338 4767

" We shall not cease from exploration
   And the end of all our exploring
   Will be to arrive where we started
   And know the place for the first time. "

                                      T. S. Eliot

On Sat, 9 Dec 2006, Volker Springel wrote:

>
>
> Hi Michele,
>
> It could be that your FFTW library was compiled with default settings
> (which makes it double-precision), while with the gadget-Makefile you
> used, you will call it as single-precision library. If that's the
> problem, then setting DOUBLEPRECISION_FFTW should fix it.
>
> Volker
>
> On Saturday 09 December 2006 01:06, Michele Trenti wrote:
>> Hello,
>>
>> I have also just experienced segmentation faults on Gadget2 during PM
>> calculation (backtraced at pm_periodic.c:271), while running on the
>> Xeon cluster at NCSA (see debugging information below). The
>> segmentation fault happens only for "large" runs, e.g. 512^3, while
>> with small N, like 64^3 all works nicely.
>>
>> I was wondering if someone has experience using Gadget2 on the same
>> system (or on another Teragrid system, I just started exploring
>> NCSA clusters, but my allocation is Teragrid wide) and is willing to
>> share his/her expertize on the compilation. Maybe my Makefile
>> (reported at the end) is not completely correct?
>>
>> Output during the run:
>> ---------------------------------------------------------
>> [trenti_at_tund ~/gadget_test]$ more gad_512_512.843502.o
>>
>> This is Gadget, version `2.0'.
>>
>> Running on 16 processors.
>>
>> found 15 times in output-list.
>>
>> Allocated 100 MByte communication buffer per processor.
>>
>> Communication buffer has room for 2383126 particles in gravity
>> computation Communication buffer has room for 819200 particles in
>> density computation Communication buffer has room for 655360
>> particles in hydro computation Communication buffer has room for
>> 609636 particles in domain decomposition
>>
>>
>> Hubble (internal units) = 0.1
>> G (internal units) = 43007.1
>> UnitMass_in_g = 1.989e+43
>> UnitTime_in_s = 3.08568e+16
>> UnitVelocity_in_cm_per_s = 100000
>> UnitDensity_in_cgs = 6.76991e-22
>> UnitEnergy_in_cgs = 1.989e+53
>>
>> Task=0 FFT-Slabs=32
>> Task=1 FFT-Slabs=32
>> Task=2 FFT-Slabs=32
>> Task=3 FFT-Slabs=32
>> Task=4 FFT-Slabs=32
>> Task=5 FFT-Slabs=32
>> Task=6 FFT-Slabs=32
>> Task=7 FFT-Slabs=32
>> Task=8 FFT-Slabs=32
>> Task=9 FFT-Slabs=32
>> Task=10 FFT-Slabs=32
>> Task=11 FFT-Slabs=32
>> Task=12 FFT-Slabs=32
>> Task=13 FFT-Slabs=32
>> Task=14 FFT-Slabs=32
>> Task=15 FFT-Slabs=32
>>
>> Allocated 896 MByte for particle storage. 80
>>
>>
>> reading file `./ic512_512_gic' on task=0 (contains 134217728
>> particles.) distributing this file to tasks 0-15
>> Type 0 (gas): 0 (tot= 0000000000) masstab=0
>> Type 1 (halo): 134217728 (tot= 0134217728) masstab=7.2163
>> Type 2 (disk): 0 (tot= 0000000000) masstab=0
>> Type 3 (bulge): 0 (tot= 0000000000) masstab=0
>> Type 4 (stars): 0 (tot= 0000000000) masstab=0
>> Type 5 (bndry): 0 (tot= 0000000000) masstab=0
>>
>> reading done.
>> Total number of particles : 0134217728
>>
>> allocated 0.0762939 Mbyte for ngb search.
>>
>> Allocated 627.963 MByte for BH-tree. 64
>>
>> domain decomposition...
>> NTopleaves= 512
>> work-load balance=1.00646 memory-balance=1.00646
>> exchange of 0117510361 particles
>> exchange of 0057421241 particles
>> exchange of 0012167098 particles
>> exchange of 0003632194 particles
>> domain decomposition done.
>> begin Peano-Hilbert order...
>> Peano-Hilbert done.
>> Begin Ngb-tree construction.
>> Ngb-Tree contruction finished
>>
>> Setting next time for snapshot file to Time_next= 0.0322581
>>
>>
>> Begin Step 0, Time: 0.02, Redshift: 49, Systemstep: 0, Dloga: 0
>> domain decomposition...
>> NTopleaves= 512
>> work-load balance=1.00646 memory-balance=1.00646
>> domain decomposition done.
>> begin Peano-Hilbert order...
>> Peano-Hilbert done.
>> Start force computation...
>> Starting periodic PM calculation.
>>
>> Allocated 102.556 MByte for FFT data.
>>
>> done PM.
>> Tree construction.
>> Tree construction done.
>> Begin tree force.
>> tree is done.
>> Begin tree force.
>> tree is done.
>> force computation done.
>> type=1 dmean=1000 asmth=1250 minmass=7.2163 a=0.02
>> sqrt(<p^2>)=1.80051 dlogmax=0.801017
>> displacement time constraint: 0.025 (0.025)
>>
>> Begin Step 1, Time: 0.0202531, Redshift: 48.3752, Systemstep:
>> 0.000253062, Dloga: 0.0125737
>> domain decomposition...
>> NTopleaves= 512
>> work-load balance=1.02818 memory-balance=1.03561
>> exchange of 0001322377 particles
>> domain decomposition done.
>> begin Peano-Hilbert order...
>> Peano-Hilbert done.
>> Start force computation...
>> Starting periodic PM calculation.
>> Segmentation fault (core dumped)
>> User defined signal 2
>> [trenti_at_tund ~/gadget_test]$
>> -----------------------------------------------------
>>
>>
>> And this is the gdb analysis of the core file:
>> -------------------------------------------------------
>> [trenti_at_tund debug]$ gdb ./Gadget2DEBUG core.14409
>> GNU gdb Red Hat Linux (5.3post-0.20021129.18rh)
>> Copyright 2003 Free Software Foundation, Inc.
>> GDB is free software, covered by the GNU General Public License, and
>> you are
>> welcome to change it and/or distribute copies of it under certain
>> conditions.
>> Type "show copying" to see the conditions.
>> There is absolutely no warranty for GDB. Type "show warranty" for
>> details.
>> This GDB was configured as "i386-redhat-linux-gnu"...
>> Core was generated by `Gadget2DEBUG cluster.param'.
>> Program terminated with signal 11, Segmentation fault.
>> Reading symbols from
>> /usr/apps/math/gsl/gsl-1.6/intel90/lib/libgsl.so.0...done.
>> Loaded symbols for /usr/apps/math/gsl/gsl-1.6/intel90/lib/libgsl.so.0
>> Reading symbols from
>> /usr/apps/math/gsl/gsl-1.6/intel90/lib/libgslcblas.so.0...done.
>> Loaded symbols for
>> /usr/apps/math/gsl/gsl-1.6/intel90/lib/libgslcblas.so.0 Reading
>> symbols from /usr/local/intel/9.0.026/lib/libimf.so...done. Loaded
>> symbols for /usr/local/intel/9.0.026/lib/libimf.so
>> Reading symbols from /lib/i686/libm.so.6...done.
>> Loaded symbols for /lib/i686/libm.so.6
>> Reading symbols from
>> /usr/local/cmpipro-2.1.0-1tgm2/lib/libcmpi.so...done. Loaded symbols
>> for /usr/local/cmpipro-2.1.0-1tgm2/lib/libcmpi.so Reading symbols
>> from /lib/i686/libpthread.so.0...done.
>> Loaded symbols for /lib/i686/libpthread.so.0
>> Reading symbols from /opt/gm/lib/libgm.so.0...done.
>> Loaded symbols for /opt/gm/lib/libgm.so.0
>> Reading symbols from /lib/libgcc_s.so.1...done.
>> Loaded symbols for /lib/libgcc_s.so.1
>> Reading symbols from /lib/i686/libc.so.6...done.
>> Loaded symbols for /lib/i686/libc.so.6
>> Reading symbols from /lib/libdl.so.2...done.
>> Loaded symbols for /lib/libdl.so.2
>> Reading symbols from /lib/ld-linux.so.2...done.
>> Loaded symbols for /lib/ld-linux.so.2
>> Reading symbols from /lib/libnss_files.so.2...done.
>> Loaded symbols for /lib/libnss_files.so.2
>> #0 0x08062562 in pmforce_periodic () at pm_periodic.c:271
>> 271 workspace[(slab_x * dimy + slab_y) * dimz + slab_z] +=
>> P[i].Mass * (1.0 - dx) * (1.0 - dy) * (1.0 - dz);
>> (gdb) backtrace
>> #0 0x08062562 in pmforce_periodic () at pm_periodic.c:271
>> #1 0xbfffcef0 in ?? ()
>> Cannot access memory at address 0x2
>> (gdb)
>>
>> -----------------------------------------------
>>
>> And finally this is what I use as Makefile:
>> --------------------------------------------
>> [trenti_at_tund source]$ more Makefile
>>
>> #--------------------------------------------------------------------
>> -- # From the list below, please activate/deactivate the options that
>> # apply to your run. If you modify any of these options, make sure #
>> that you recompile the whole code by typing "make clean; make". #
>> # Look at end of file for a brief guide to the compile-time options.
>> #--------------------------------------------------------------------
>> --
>>
>>
>> #--------------------------------------- Basic operation mode of code
>> OPT += -DPERIODIC
>> #OPT += -DUNEQUALSOFTENINGS
>>
>>
>> #--------------------------------------- Things that are always
>> recommended
>> OPT += -DPEANOHILBERT
>> OPT += -DWALLCLOCK
>>
>>
>> #--------------------------------------- TreePM Options
>> OPT += -DPMGRID=512
>> #OPT += -DPLACEHIGHRESREGION=3
>> #OPT += -DENLARGEREGION=1.2
>> #OPT += -DASMTH=1.25
>> #OPT += -DRCUT=4.5
>>
>>
>> #--------------------------------------- Single/Double Precision
>> #OPT += -DDOUBLEPRECISION
>> #OPT += -DDOUBLEPRECISION_FFTW
>>
>>
>> #--------------------------------------- Time integration options
>> OPT += -DSYNCHRONIZATION
>> #OPT += -DFLEXSTEPS
>> #OPT += -DPSEUDOSYMMETRIC
>> #OPT += -DNOSTOP_WHEN_BELOW_MINTIMESTEP
>> #OPT += -DNOPMSTEPADJUSTMENT
>>
>>
>> #--------------------------------------- Output options
>> #OPT += -DHAVE_HDF5
>> #OPT += -DOUTPUTPOTENTIAL
>> #OPT += -DOUTPUTACCELERATION
>> #OPT += -DOUTPUTCHANGEOFENTROPY
>> #OPT += -DOUTPUTTIMESTEP
>>
>>
>> #--------------------------------------- Things for special behaviour
>> #OPT += -DNOGRAVITY
>> #OPT += -DNOTREERND
>> #OPT += -DNOTYPEPREFIX_FFTW
>> #OPT += -DLONG_X=60
>> #OPT += -DLONG_Y=5
>> #OPT += -DLONG_Z=0.2
>> #OPT += -DTWODIMS
>> #OPT += -DSPH_BND_PARTICLES
>> #OPT += -DNOVISCOSITYLIMITER
>> #OPT += -DCOMPUTE_POTENTIAL_ENERGY
>> #OPT += -DLONGIDS
>> #OPT += -DISOTHERMAL
>> #OPT += -DSELECTIVE_NO_GRAVITY=2+4+8+16
>>
>> #--------------------------------------- Testing and Debugging
>> options #OPT += -DFORCETEST=0.1
>>
>>
>> #--------------------------------------- Glass making
>> #OPT += -DMAKEGLASS=262144
>>
>>
>> #--------------------------------------------------------------------
>> -- # Here, select compile environment for the target machine. This may
>> need # adjustment, depending on your local system. Follow the
>> examples to add # additional target platforms, and to get things
>> properly compiled.
>> #--------------------------------------------------------------------
>> --
>>
>> #--------------------------------------- Select some defaults
>>
>> CC = cmpicc # sets the C-compiler
>> OPTIMIZE = -O3 -Wall # sets optimization and warning flags
>> MPICHLIB = -lmpich
>>
>>
>> #--------------------------------------- Select target computer
>>
>> #SYSTYPE="UDF"
>> SYSTYPE="XEON"
>> #SYSTYPE="Regatta"
>> #SYSTYPE="RZG_LinuxCluster"
>> #SYSTYPE="RZG_LinuxCluster-gcc"
>> #SYSTYPE="Opteron"
>>
>> #--------------------------------------- Adjust settings for target
>> computer
>>
>>
>> ifeq ($(SYSTYPE),"XEON")
>> CC = cmpicc
>> OPTIMIZE = -O3 -Wall -g
>> GSL_INCL = -I/${GSL_HOME}/include
>> GSL_LIBS = -L/${GSL_HOME}/lib
>> FFTW_INCL= -I/${FFTW_HOME}/include
>> FFTW_LIBS= -L/${FFTW_HOME}/lib
>> MPICHLIB =
>> HDF5INCL =
>> HDF5LIB =
>> endif
>>
>> ...
>>
>> ---------------------------------------
>>
>>
>> Thanks a lot for your help,
>>
>> Michele
>>
>> Michele Trenti
>> Space Telescope Science Institute
>> 3700 San Martin Drive Phone: +1 410 338 4987
>> Baltimore MD 21218 U.S. Fax: +1 410 338 4767
>>
>>
>> " We shall not cease from exploration
>> And the end of all our exploring
>> Will be to arrive where we started
>> And know the place for the first time. "
>>
>> T. S. Eliot
>>
>> On Sat, 2 Dec 2006, Volker Springel wrote:
>>> On Wednesday 29 November 2006 22:44, Craig Rudick wrote:
>>>> Hi,
>>>>
>>>> We have been attempting to switch from Gadget1 to Gadget2, but
>>>> have been running into the problem that Gadget2 produces a
>>>> segmentation fault and dies when we try to run the example initial
>>>> conditions. The segmentation fault almost always occurrs during
>>>> or immediately following the first domain decomposition, with
>>>> output that reads:
>>>>
>>>> domain decomposition...
>>>> Segmentation fault
>>>>
>>>> We have tried compiling using both the Portland Group and Intel
>>>> compilers and see the same behavior.
>>>>
>>>> The really frustrating part is that we only get this error
>>>> depending on both on the initial conditions used, and the number
>>>> or processors on which it is run. That is, the 'cluster' example
>>>> IC runs perfectly on up to 16 nodes, however all of the other
>>>> example ICs will run only on four or fewer nodes (2 processors per
>>>> node).
>>>>
>>>> Has anyone seen similar errors with Gadget2 or have any ideas on
>>>> what might be the solution to this error?
>>>
>>> Hi Craig,
>>>
>>> This is strange. I can't reproduce this problem on any of the
>>> machines I have access to (which are a few), and I also haven't
>>> heard from anyone else experiencing this error. It could be related
>>> to the set-up of your cluster and/or the compiler/MPI library you
>>> are using. I'd suggest to compile with the gcc-compiler (with -g)
>>> and look at the core file that's produced by the crash with a
>>> debugger. If the crash is reproducible, this would tell you if it
>>> is caused by gadget2, and where this happens.
>>>
>>> Volker
>>>
>>>> Thanks,
>>>> Craig Rudick
>>>> Case Western Reserve University
>>>>
>>>>
>>>>
>>>>
>>>> -----------------------------------------------------------
>>>>
>>>> If you wish to unsubscribe from this mailing, send mail to
>>>> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe
>>>> gadget-list A web-archive of this mailing list is available here:
>>>> http://www.mpa-garching.mpg.de/gadget/gadget-list
>>>
>>> -----------------------------------------------------------
>>>
>>> If you wish to unsubscribe from this mailing, send mail to
>>> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe
>>> gadget-list A web-archive of this mailing list is available here:
>>> http://www.mpa-garching.mpg.de/gadget/gadget-list
>>
>> -----------------------------------------------------------
>>
>> If you wish to unsubscribe from this mailing, send mail to
>> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe
>> gadget-list A web-archive of this mailing list is available here:
>> http://www.mpa-garching.mpg.de/gadget/gadget-list
>
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
>
Received on 2006-12-09 20:05:10