Re: Issue with Space Available

From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
Date: Thu, 18 Mar 2021 16:40:00 +0100

Hi Dylan,

For some reason, your MPI library does not use the virtual filesystem /dev/shm as the backing store for the shared memory allocation, but rather an actual memory-mapped file placed into /tmp (which is in principle also possible, but very old fashioned and not to be recommended).

The reason why you can only allocate ~7.6 GB is then likely that you have only that much space available in /tmp

Issue the "df -k" command to get an overview of the space available in /tmp and /dev/shm

To fix the problem, you can in principle either increase space in /tmp, or use the "orte_tmpdir_base" parameter of OpenMPI to place the shared memory backing file into another directory where you have more quota...

However, both is really not a good solution. Rather, if you have /dev/shm on your system, OpenMPI ought to use this for maximum performance and convenience. Normally, reasonably recent versions of OpenMPI detect this automatically and don't use a file /tmp... Could be that you have a quite old version of openmpi, or one that wasn't compiled for your system.

Issue the "orte-info" command and check what version of OpenMPI you are using. If you are using older ones than 4.0, download OpenMPI, compile it, and use this instead.

Best,
Volker


> On 18. Mar 2021, at 15:25, dylan.chosson_at_edu.univ-fcomte.fr wrote:
>
> Dear all,
>
> I am a new user of GADGET-4.
> I have created an initial condition file using N-GenIC for 256^3 particles (dark matter only). But when I execute the GADGET code with
> "#!/bin/bash
> #SBATCH --time=24:00:00
> #SBATCH --nodes=1
> #SBATCH --ntasks-per-node=45
> #SBATCH --job-name=gadget-4_large_scale
> #SBATCH --output=gadget_output.txt
>
> echo
> echo "Running on hosts: $SLURM_NODELIST"
> echo "Running on $SLURM_NNODES nodes."
> echo "Running on $SLURM_NPROCS processors."
> echo "Current working directory is `pwd`"
> echo
>
> mpiexec -np $SLURM_NPROCS ./Gadget4 param.txt", I got the following error:
>
> "[compuphys-calc:01028] *** Process received signal ***
> [compuphys-calc:01028] Signal: Segmentation fault (11)
> [compuphys-calc:01028] Signal code: Address not mapped (1)
> [compuphys-calc:01028] Failing at address: (nil)
> --------------------------------------------------------------------------
> It appears as if there is not enough space for /tmp/openmpi-sessions-2983_at_compuphys-calc_0/57904/1/0/shared_window_5.compuphys-calc (the shared-memory backing
> file). It is likely that your MPI job will now either abort or experience
> performance degradation.
>
> Local host: compuphys-calc
> Space Requested: 61341885832 B
> Space Available: 7678644224 B
> --------------------------------------------------------------------------
> [compuphys-calc:01028] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7fa64a1b4980]
> [compuphys-calc:01028] [ 1] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_osc_sm.so(ompi_osc_sm_free+0x10c)[0x7fa62dc17abc]
> [compuphys-calc:01028] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi/mca_osc_sm.so(+0x2e6f)[0x7fa62dc17e6f]
> [compuphys-calc:01028] [ 3] /usr/lib/x86_64-linux-gnu/libmpi.so.20(ompi_win_allocate_shared+0x9b)[0x7fa64ad49b9b]
> [compuphys-calc:01028] [ 4] /usr/lib/x86_64-linux-gnu/libmpi.so.20(PMPI_Win_allocate_shared+0xd6)[0x7fa64ad7e826]
> [compuphys-calc:01028] [ 5] ./Gadget4(+0x31d3d)[0x5594c2b9ad3d]
> [compuphys-calc:01028] [ 6] ./Gadget4(+0x1e94a)[0x5594c2b8794a]
> [compuphys-calc:01028] [ 7] ./Gadget4(+0x1cd63)[0x5594c2b85d63]
> [compuphys-calc:01028] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fa649dd2bf7]
> [compuphys-calc:01028] [ 9] ./Gadget4(+0x1df9a)[0x5594c2b86f9a]
> [compuphys-calc:01028] *** End of error message ***
> --------------------------------------------------------------------------
> mpiexec noticed that process rank 0 with PID 1028 on node compuphys-calc exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> [compuphys-calc:01012] 44 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
> [compuphys-calc:01012] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages"
>
>
> I use a university server dedicated to 2nd year Masters students with the following information:
>
>
> Running on hosts: compuphys-calc
> Running on 1 nodes.
> Running on 45 processors.
> Current working directory is /home/dchosso/Sem_10/Gadget-4/gadget4/my_sim/ics256
>
> --------------------------------------------------------------------------
> [[57904,1],34]: A high-performance Open MPI point-to-point messaging module
> was unable to find any relevant network interfaces:
>
> Module: OpenFabrics (openib)
> Host: compuphys-calc
>
> Another transport will be used instead, although this may result in
> lower performance.
>
> NOTE: You can disable this warning by setting the MCA parameter
> btl_base_warn_component_unused to 0.
> --------------------------------------------------------------------------
> Shared memory islands host a minimum of 45 and a maximum of 45 MPI ranks.
>
> ___ __ ____ ___ ____ ____ __
> / __) /__\ ( _ \ / __)( ___)(_ _)___ /. |
> ( (_-. /(__)\ )(_) )( (_-. )__) )( (___)(_ _)
> \___/(__)(__)(____/ \___/(____) (__) (_)
>
> This is Gadget, version 4.0.
> Git commit 01e6b1567c93fe1cfaffd499aa55151db2ed4208, Tue Mar 2 13:22:03 2021 +0100
>
> Code was compiled with the following compiler and flags:
> mpicxx -std=c++11 -ggdb -O3 -march=native -Wall -Wno-format-security -I/home/dchosso/Sem_10/hdf5-1.8.22/hdf5/include -I/home/dchosso/Sem_10/gsl-2.6/include -I/home/dchosso/Sem_10/fftw-3.3.9/include -Imy_sim/ics256/build -Isrc
>
>
> Code was compiled with the following settings:
> ASMTH=2.0
> DOUBLEPRECISION=2
> GADGET2_HEADER
> LEAN
> NSOFTCLASSES=1
> NTYPES=2
> PERIODIC
> PMGRID=256
> POSITIONS_IN_32BIT
> POWERSPEC_ON_OUTPUT
> RANDOMIZE_DOMAINCENTER
> SELFGRAVITY
> TREEPM_NOTIMESPLIT
>
> Running on 45 MPI tasks.
>
> BEGRUN: Size of particle structure 56 [bytes]
> BEGRUN: Size of sph particle structure 96 [bytes]
> BEGRUN: Size of gravity tree node 72 [bytes]
> BEGRUN: Size of neighbour tree node 112 [bytes]
> BEGRUN: Size of subfind auxiliary data 36 [bytes]
>
> -------------------------------------------------------------------------------------------------------------------------
> AvailMem: Largest = 61581.14 Mb (on task= 0), Smallest = 61581.14 Mb (on task= 0), Average = 61581.14 Mb
> Total Mem: Largest = 63898.89 Mb (on task= 0), Smallest = 63898.89 Mb (on task= 0), Average = 63898.89 Mb
> Committed_AS: Largest = 2317.74 Mb (on task= 0), Smallest = 2317.74 Mb (on task= 0), Average = 2317.74 Mb
> SwapTotal: Largest = 8192.00 Mb (on task= 0), Smallest = 8192.00 Mb (on task= 0), Average = 8192.00 Mb
> SwapFree: Largest = 8054.43 Mb (on task= 0), Smallest = 8054.43 Mb (on task= 0), Average = 8054.43 Mb
> AllocMem: Largest = 2317.74 Mb (on task= 0), Smallest = 2317.74 Mb (on task= 0), Average = 2317.74 Mb
> avail /dev/shm: Largest = 60703.95 Mb (on task= 0), Smallest = 60703.95 Mb (on task= 0), Average = 60703.95 Mb
> -------------------------------------------------------------------------------------------------------------------------
>
> Task=0 has the maximum commited memory and is host: compuphys-calc
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Obtaining parameters from file 'param.txt':
>
> InitCondFile /home/dchosso/Sem_10/Gadget-4/ICs/ics256/ics256
> OutputDir /home/dchosso/Sem_10/Gadget-4/gadget4/my_sim/ics256/output
> SnapshotFileBase snapshot
> OutputListFilename outputs_lcdm_gas.txt
> ICFormat 1
> SnapFormat 2
> TimeLimitCPU 86400
> CpuTimeBetRestartFile 7200
> MaxMemSize 1300
> TimeBegin 0.0909091
> TimeMax 1
> ComovingIntegrationOn 1
> Omega0 0.3
> OmegaLambda 0.7
> OmegaBaryon 0.04
> HubbleParam 0.7
> Hubble 0.1
> BoxSize 50000
> OutputListOn 0
> TimeBetSnapshot 1.06278
> TimeOfFirstSnapshot 0.95
> TimeBetStatistics 0.05
> NumFilesPerSnapshot 1
> MaxFilesWithConcurrentIO 1
> ErrTolIntAccuracy 0.01
> CourantFac 0.3
> MaxSizeTimestep 0.025
> MinSizeTimestep 0
> TypeOfOpeningCriterion 1
> ErrTolTheta 0.75
> ErrTolThetaMax 1
> ErrTolForceAcc 0.0025
> TopNodeFactor 2.5
> ActivePartFracForNewDomainDecomp 0.01
> ActivePartFracForPMinsteadOfEwald 0.05
> DesNumNgb 64
> MaxNumNgbDeviation 1
> UnitLength_in_cm 3.08568e+21
> UnitMass_in_g 1.989e+43
> UnitVelocity_in_cm_per_s 100000
> GravityConstantInternal 0
> SofteningComovingClass0 0.01
> SofteningMaxPhysClass0 0.01
> SofteningClassOfPartType0 0
>
> As you can see, there is 1 node and 45 cores available.
>
> Changing MaxMemSize to a lower value (e.g. 600) only changes "Space Requested".
> My question is: why Gadget-4 only seems to see ~7.6Gb available when there is ~60Gb.
>
> I have already checked Tiago's post on the topic "Not enough memory" and contacted the system administrators about the "half memory of machine" and they have already used the command " mount –o remount,size=95% /dev/shm ".
>
> I would be very grateful if some one could help me in this issue.
>
> With best regards,
> Dylan
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
Received on 2021-03-18 16:40:01

This archive was generated by hypermail 2.3.0 : 2022-09-01 14:03:43 CEST