Re: On Gadget-4 Foreign Nodes

From: Tiago Castro <tiagobscastro_at_gmail.com>
Date: Sat, 19 Dec 2020 18:40:30 +0100

Many thanks for your careful inspection of my log run, Volker!

Best,
*Tiago Castro* Post Doc, Department of Physics / UNITS / OATS
Phone: *(* <%28+39%29%20327%20498%200157>*+39 040 3199 120) *
<%28+39%29%20327%20498%200157>
Mobile: *(* <%28+39%29%20327%20498%200157>*+39 388 794 1562) *
<%28+39%29%20327%20498%200157>
Email: *tiagobscastro_at_gmail.com* <tiagobscastro_at_gmail.com>
Website: *tiagobscastro.com <http://tiagobscastro.com>*
<http://sites.if.ufrj.br/castro/en>
Skype: *tiagobscastro* <https://webapp.wisestamp.com/#>
Address:
*Osservatorio Astronomico di Trieste / Villa BazzoniVia Bazzoni, *
*2, 34143 Trieste TS* [image: photo]
<http://ws-promos.appspot.com/r?rdata=eyJydXJsIjogImh0dHA6Ly93d3cud2lzZXN0YW1wLmNvbS9lbWFpbC1pbnN0YWxsP3dzX25jaWQ9NjcyMjk0MDA4JnV0bV9zb3VyY2U9ZXh0ZW5zaW9uJnV0bV9tZWRpdW09ZW1haWwmdXRtX2NhbXBhaWduPXByb21vXzU3MzI1Njg1NDg3Njk3OTIiLCAiZSI6ICI1NzMyNTY4NTQ4NzY5NzkyIn0=&u=754281802009791>


Em sáb., 19 de dez. de 2020 às 17:49, Volker Springel <
vspringel_at_mpa-garching.mpg.de> escreveu:

>
> Hi Tiago,
>
> The code has experienced a memory problem in building the locally
> essential tree, because you need more space for the imported tree nodes
> (these are quite bulky because you 5-th order and double precision
> throughout).
>
> The best would be that you increase MaxMemSize in your parameterfile...
> because you are currently only using slightly less than half of the memory
> that is available on your compute nodes, there is in principle plenty of
> room for this.
>
> However, from your output for "avail /dev/shm:" I can see that you are
> suffering from a familiar, poor configuration of your compute nodes... Your
> shared memory is unnecessarily restricted to 50% of the physical memory.
>
> This 50% restriction is an unfortunate default setting for the maximum
> shared memory adopted in many Linux distributions. With this, GADGET4 can
> only use half of the available memory...
>
> But there is really no deeper reason for this limit, and one can easily
> change the maximum size of /dev/shm to, for example, 95% of the physical
> memory.
>
> Look for example at this paper
> https://dl.acm.org/doi/10.1145/3176364.3176367 describing the issue,
> particularly the end of section 3.2, where a plea is made to system
> administrators to correct this setting. In fact, the stability of the
> compute nodes is completely unaffected and just fine if the limit is raied
> to, e.g., 95% of the available memory. This can be done with a simple
>
> mount –o remount,size=95% /dev/shm
>
> command on the fly, but this is of course only possible for
> administrators. And in any case, the setting needs to be made permanent to
> stay in place after the next reboot.
>
> I note that this has change been implemented on all the supercomputers in
> Garching (both at the LRZ and MPCDF), causing no problems at all. I would
> therefore recommend that you ask your system administrator to do the same
> on your machine, too.
>
> Then you can increase MaxMemSize and the problem should be solvable that
> way.
>
> Otherwise, you can in principle try to increase the 0.33 number in the
> following line
>
> int nspace = (0.33 * Mem.FreeBytes) / (sizeof(gravnode) + 8 *
> sizeof(foreign_gravpoint_data));
>
> in fmm.cc, for example to 0.5, or even a bit larger. This might work in
> your particular case, but only if you're lucky.
>
> Best,
> Volker
>
>
>
> > On 16. Dec 2020, at 18:46, Tiago Castro <tiagobscastro_at_gmail.com> wrote:
> >
> > Dear list,
> >
> > I am trying to run a cosmological simulation (500.0 Mpc/h, 1024^3
> particles), and the code returns me the error below. Looking at the source
> code, I could not understand exactly how MaxForeignNodes is decided and if
> there's something I can try to change on the parameters file. I am already
> using the entire local cluster.
> >
> > Many thanks!
> > ---------------------------------
> > Shared memory islands host a minimum of 37 and a maximum of 37 MPI
> ranks.
> > We shall use 6 MPI ranks in total for assisting one-sided communication
> (1 per shared memory node).
> >
> > ___ __ ____ ___ ____ ____ __
> > / __) /__\ ( _ \ / __)( ___)(_ _)___ /. |
> > ( (_-. /(__)\ )(_) )( (_-. )__) )( (___)(_ _)
> > \___/(__)(__)(____/ \___/(____) (__) (_)
> >
> > This is Gadget, version 4.0.
> > Git commit unknown, unknown
> >
> > Code was compiled with the following compiler and flags:
> > mpicxx -std=c++11 -ggdb -O3 -march=native -Wall -Wno-format-security
> -I/beegfs/tcastro/gadget4/include/ -I/beegfs/tcastro/gadge
> > t4/include/gsl -I/beegfs/tcastro/gadget4/include/ -Ibuild -Isrc
> >
> >
> > Code was compiled with the following settings:
> > ASMTH=3.0
> > CREATE_GRID
> > DOUBLEPRECISION=1
> > DOUBLEPRECISION_FFTW
> > ENLARGE_DYNAMIC_RANGE_IN_TIME
> > FMM
> > FOF
> > FOF_GROUP_MIN_LEN=100
> > FOF_LINKLENGTH=0.2
> > FOF_PRIMARY_LINK_TYPES=2
> > HIERARCHICAL_GRAVITY
> > IMPOSE_PINNING
> > LEAN
> > MERGERTREE
> > MULTIPOLE_ORDER=5
> > NGENIC=1024
> > NGENIC_2LPT
> > NSOFTCLASSES=1
> > NTAB=256
> > NTYPES=6
> > OUTPUT_TIMESTEP
> > PERIODIC
> > PMGRID=1024
> > POWERSPEC_ON_OUTPUT
> > PRESERVE_SHMEM_BINARY_INVARIANCE
> > RANDOMIZE_DOMAINCENTER
> > RCUT=6.0
> > SELFGRAVITY
> > SUBFIND
> > SUBFIND_HBT
> > TREE_NUM_BEFORE_NODESPLIT=4
> >
> >
> > Running on 216 MPI tasks.
> >
> >
> > BEGRUN: Size of particle structure 128 [bytes]
> > BEGRUN: Size of sph particle structure 216 [bytes]
> > BEGRUN: Size of gravity tree node 352 [bytes]
> > BEGRUN: Size of neighbour tree node 192 [bytes]
> > BEGRUN: Size of subfind auxiliary data 64 [bytes]
> >
> > PINNING: We have 4 sockets, 40 physical cores and 40 logical cores on
> the first MPI-task's node.
> > PINNING: Looks like 10 logical cores are available.
> > PINNING: Looks like already before start of the code, a tight binding
> was imposed.
> > PINNING: We refrain from any pinning attempt ourselves. (This can be
> changed by setting the compile flag IMPOSE_PINNING_OVERRIDE_MODE
> > .)
> >
> >
> -------------------------------------------------------------------------------------------------------------------------
>
> > AvailMem: Largest = 251624.55 Mb (on task= 144), Smallest =
> 251293.10 Mb (on task= 72), Average = 251463.74 Mb
> > Total Mem: Largest = 257655.01 Mb (on task= 0), Smallest =
> 257655.01 Mb (on task= 0), Average = 257655.01 Mb
> > Committed_AS: Largest = 6361.91 Mb (on task= 72), Smallest =
> 6030.45 Mb (on task= 144), Average = 6191.26 Mb
> > SwapTotal: Largest = 4000.00 Mb (on task= 0), Smallest =
> 4000.00 Mb (on task= 0), Average = 4000.00 Mb
> > SwapFree: Largest = 4000.00 Mb (on task= 0), Smallest =
> 3966.40 Mb (on task= 180), Average = 3992.73 Mb
> > AllocMem: Largest = 6361.91 Mb (on task= 72), Smallest =
> 6030.45 Mb (on task= 144), Average = 6191.26 Mb
> > avail /dev/shm: Largest = 128788.88 Mb (on task= 144), Smallest =
> 128785.64 Mb (on task= 0), Average = 128787.51 Mb
> >
> -------------------------------------------------------------------------------------------------------------------------
>
> > Task=0 has the maximum commited memory and is host: gen09-10
> >
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> > Obtaining parameters from file 'param.1024p3.txt':
> >
> > InitCondFile ./ics
> > OutputDir ./1024p3
> > SnapshotFileBase snap
> > OutputListFilename ./outputs.txt
> > ICFormat 2
> > SnapFormat 3
> > TimeLimitCPU 172800
> > CpuTimeBetRestartFile 7200
> > MaxMemSize 3200
> > TimeBegin 0.01
> > TimeMax 1
> > ComovingIntegrationOn 1
> > Omega0 0.30711
> > OmegaLambda 0.69289
> > OmegaBaryon 0.04825
> > HubbleParam 0.6777
> > Hubble 0.1
> > BoxSize 500000
> > OutputListOn 1
> > TimeBetSnapshot 0
> > TimeOfFirstSnapshot 0
> > TimeBetStatistics 0.01
> > NumFilesPerSnapshot 16
> > MaxFilesWithConcurrentIO 8
> > ErrTolIntAccuracy 0.05
> > CourantFac 0.15
> > MaxSizeTimestep 0.05
> > MinSizeTimestep 0
> > TypeOfOpeningCriterion 1
> > ErrTolTheta 0.4
> > ErrTolThetaMax 1
> > ErrTolForceAcc 0.005
> > TopNodeFactor 3
> > ActivePartFracForNewDomainDecomp 0.01
> > DesNumNgb 64
> > MaxNumNgbDeviation 1
> > UnitLength_in_cm 3.08568e+21
> > UnitMass_in_g 1.989e+43
> > UnitVelocity_in_cm_per_s 100000
> > GravityConstantInternal 0
> > SofteningComovingClass0 12
> > SofteningMaxPhysClass0 12
> > SofteningClassOfPartType0 0
> > SofteningClassOfPartType1 0
> > SofteningClassOfPartType2 0
> > SofteningClassOfPartType3 0
> > SofteningClassOfPartType4 0
> > SofteningClassOfPartType5 0
> > DesLinkNgb 20
> > ArtBulkViscConst 1
> > MinEgySpec 0
> > InitGasTemp 0
> > NSample 1024
> > GridSize 1024
> > Seed 181170
> > SphereMode 1
> > PowerSpectrumType 2
> > ReNormalizeInputSpectrum 1
> > PrimordialIndex 1
> > ShapeGamma 0.21
> > Sigma8 0.8288
> > PowerSpectrumFile powerspec
> > InputSpectrum_UnitLength_in_cm 3.08568e+24
> >
> > MALLOC: Allocation of shared memory took 0.00582997 sec
> >
> > found 5 times in output-list.
> > BEGRUN: Hubble (internal units) = 0.1
> > BEGRUN: h = 0.6777
> > BEGRUN: G (internal units) = 43018.7
> > BEGRUN: UnitMass_in_g = 1.989e+43
> > BEGRUN: UnitLenth_in_cm = 3.08568e+21
> > BEGRUN: UnitTime_in_s = 3.08568e+16
> > BEGRUN: UnitVelocity_in_cm_per_s = 100000
> > BEGRUN: UnitDensity_in_cgs = 6.76991e-22
> > BEGRUN: UnitEnergy_in_cgs = 1.989e+53
> >
> > NGENIC: generated grid of size 1024
> > NGENIC: computing displacement fields...
> > NGENIC: vel_prefac1= 5.54175 hubble_a=55.4176 fom1=0.999999
> > NGENIC: vel_prefac2= 11.0835 hubble_a=55.4176 fom2=2
> > found 579000 rows in input spectrum table
> >
> > Normalization of spectrum in file: Sigma8 = 0.819434
> > Normalization adjusted to Sigma8=0.8288 (Normfac=1.02299)
> >
> > NGENIC: Dplus=78.3218
> > NGENIC_2LPT: Computing secondary source term, derivatices 0 0
> > NGENIC: setting up modes in kspace...
> > NGENIC_2LPT: Computing secondary source term, derivatices 1 1
> > NGENIC: setting up modes in kspace...
> > NGENIC_2LPT: Computing secondary source term, derivatices 2 2
> > NGENIC: setting up modes in kspace...
> > NGENIC_2LPT: Computing secondary source term, derivatices 0 1
> > NGENIC: setting up modes in kspace...
> > NGENIC_2LPT: Computing secondary source term, derivatices 0 2
> > NGENIC: setting up modes in kspace...
> > NGENIC_2LPT: Computing secondary source term, derivatices 1 2
> > NGENIC: setting up modes in kspace...
> > NGENIC_2LPT: Secondary source term computed in real space
> > NGENIC_2LPT: Done transforming it to k-space
> > NGENIC_2LPT: Obtaining second order displacements for axes=0
> > NGENIC_2LPT: Obtaining second order displacements for axes=1
> > NGENIC_2LPT: Obtaining second order displacements for axes=2
> > NGENIC_2LPT: Obtaining Zeldovich displacements for axes=0
> > NGENIC: setting up modes in kspace...
> > NGENIC_2LPT: Obtaining Zeldovich displacements for axes=1
> > NGENIC: setting up modes in kspace...
> > NGENIC_2LPT: Obtaining Zeldovich displacements for axes=2
> > NGENIC: setting up modes in kspace...
> >
> > NGENIC: Maximum displacement: 375.266, in units of the part-spacing=
> 0.768545
> >
> >
> > NGENIC: Maximum velocity component: 2076.67
> >
> > INIT: Testing ID uniqueness...
> > INIT: success. took=1.45795 sec
> >
> > DOMAIN: Begin domain decomposition (sync-point 0).
> > DOMAIN: New shift vector determined (-165190 47404.2 171461)
> > DOMAIN: Sum=2 TotalCost=2 NumTimeBinsToBeBalanced=1 MultipleDomains=2
> > DOMAIN: Increasing TopNodeAllocFactor=0.08 new value=0.104
> > DOMAIN: Increasing TopNodeAllocFactor=0.104 new value=0.1352
> > DOMAIN: Increasing TopNodeAllocFactor=0.1352 new value=0.17576
> > DOMAIN: Increasing TopNodeAllocFactor=0.17576 new value=0.228488
> > DOMAIN: Increasing TopNodeAllocFactor=0.228488 new value=0.297034
> > DOMAIN: Increasing TopNodeAllocFactor=0.297034 new value=0.386145
> > DOMAIN: Increasing TopNodeAllocFactor=0.386145 new value=0.501988
> > DOMAIN: Increasing TopNodeAllocFactor=0.501988 new value=0.652585
> > DOMAIN: Increasing TopNodeAllocFactor=0.652585 new value=0.84836
> > DOMAIN: Increasing TopNodeAllocFactor=0.84836 new value=1.10287
> > DOMAIN: Increasing TopNodeAllocFactor=1.10287 new value=1.43373
> > DOMAIN: Increasing TopNodeAllocFactor=1.43373 new value=1.86385
> > DOMAIN: Increasing TopNodeAllocFactor=1.86385 new value=2.423
> > DOMAIN: Increasing TopNodeAllocFactor=2.423 new value=3.1499
> > DOMAIN: Increasing TopNodeAllocFactor=3.1499 new value=4.09487
> > DOMAIN: NTopleaves=4096, determination of top-level tree involved 4
> iterations and took 50.5168 sec
> > DOMAIN: we are going to try at most 474 different settings for combining
> the domains on tasks=216, nnodes=6
> > DOMAIN: total_cost=2 total_load=1
> > DOMAIN: best solution found after 1 iterations by task=75 for nextra=16,
> reaching maximum imbalance of 1.06271|1.06288
> > DOMAIN: combining multiple-domains took 0.588464 sec
> > DOMAIN: exchange of 1073741824 particles
> > DOMAIN: particle exchange done. (took 14.3663 sec)
> > DOMAIN: domain decomposition done. (took in total 67.5344 sec)
> > PEANO: Begin Peano-Hilbert order...
> > PEANO: done, took 5.81062 sec.
> >
> > SNAPSHOT: Setting next time for snapshot file to Time_next= 0.01
> (DumpFlag=1)
> >
> >
> >
> > Sync-Point 0, Time: 0.01, Redshift: 99, Systemstep: 0, Dloga: 0,
> Nsync-grv: 1073741824, Nsync-hyd: 0
> > DOMAIN: Begin domain decomposition (sync-point 0).
> > DOMAIN: New shift vector determined (-141172 229279 -198623)
> > DOMAIN: Sum=2 TotalCost=2 NumTimeBinsToBeBalanced=1 MultipleDomains=2
> > DOMAIN: NTopleaves=4096, determination of top-level tree involved 4
> iterations and took 6.65769 sec
> > DOMAIN: we are going to try at most 474 different settings for combining
> the domains on tasks=216, nnodes=6
> > DOMAIN: total_cost=2 total_load=1
> > DOMAIN: best solution found after 1 iterations by task=72 for nextra=20,
> reaching maximum imbalance of 1.06096|1.06104
> > DOMAIN: combining multiple-domains took 0.492839 sec
> > DOMAIN: exchange of 1073741824 particles
> > DOMAIN: particle exchange done. (took 12.0037 sec)
> > DOMAIN: domain decomposition done. (took in total 21.0146 sec)
> > PEANO: Begin Peano-Hilbert order...
> > PEANO: done, took 5.64457 sec.
> > ACCEL: Start tree gravity force computation... (1073741824 particles)
> > PM-PERIODIC: Starting periodic PM calculation. (Rcut=8789.06) presently
> allocated=1106.3 MB
> > PM-PERIODIC: done. (took 11.8741 seconds)
> > TIMESTEPS: displacement time constraint: 0.0926602 (0.05)
> > TREE: Full tree construction for all particles. (presently
> allocated=1637.45 MB)
> > GRAVTREE: Tree construction done. took 9.68924 sec <numnodes>=703477
> NTopnodes=4681 NTopleaves=4096 tree-build-scalability=0.993377
> > FMM: Begin tree force. timebin=0 (presently allocated=0.4 MB)
> > Code termination on task=208, function tree_fetch_foreign_nodes(), file
> src/tree/tree.cc, line 1101: We are out of storage for foreig
> > n nodes: NumForeignNodes=587074 MaxForeignNodes=587074 j=1 n_parts=0
> >
> --------------------------------------------------------------------------
> > MPI_ABORT was invoked on rank 208 in communicator MPI_COMM_WORLD
> > with errorcode 1.
> >
> > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> > You may or may not see output from other processes, depending on
> > exactly when Open MPI kills them.
> >
> --------------------------------------------------------------------------
> >
> > Tiago Castro Post Doc, Department of Physics / UNITS / OATS
> > Phone: (+39 040 3199 120)
> > Mobile: (+39 388 794 1562)
> > Email: tiagobscastro_at_gmail.com
> > Website: tiagobscastro.com
> > Skype: tiagobscastro
> > Address: Osservatorio Astronomico di Trieste / Villa Bazzoni
> > Via Bazzoni, 2, 34143 Trieste TS
> >
> >
> >
> > -----------------------------------------------------------
> >
> > If you wish to unsubscribe from this mailing, send mail to
> > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe
> gadget-list
> > A web-archive of this mailing list is available here:
> > http://www.mpa-garching.mpg.de/gadget/gadget-list
>
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
>
Received on 2020-12-19 18:40:54

This archive was generated by hypermail 2.3.0 : 2023-01-10 10:01:32 CET