Re: Various MPI and memory allocation issues from Volker Springel on 2021-06-29 (GADGET General Discussion Mailing List)

From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
Date: Tue, 29 Jun 2021 11:54:23 +0200

Hi Robin,

> On 25. Jun 2021, at 17:49, Robin Booth <robin.booth_at_sussex.ac.uk> wrote:
>
> Hi Volker
>
> I am carrying out various tests in advance of a fairly large (2^33 particles) Gadget4 run on the COSMA7 cluster at Durham, and I am encountering some issues relating to MPI operation and memory allocation. Currently I am running Gadget4 using the open_mpi/3.0.1 library.
>
> Issue 1
> The first issue is the occurrence of this error message, which I assume is generated by the MPI library rather than by the OS or by Gadget itself:
>
> --------------------------------------------------------------------------
> It appears as if there is not enough space for /tmp/ompi.m7007.21208/jf.44296/1/15/shared_window_5.m7007 (the shared-memory backing
> file). It is likely that your MPI job will now either abort or experience
> performance degradation.
>
> Local host: m7007.pri.cosma7.alces.network
> Space Requested: 78643266120 B
> Space Available: 67878432768 B
> --------------------------------------------------------------------------
> [m7007:106436] *** An error occurred in MPI_Win_allocate_shared
> [m7007:106436] *** reported by process [47573960753153,15]
> [m7007:106436] *** on communicator MPI COMMUNICATOR 3 SPLIT_TYPE FROM 0
> [m7007:106436] *** MPI_ERR_INTERN: internal error
> [m7007:106436] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> [m7007:106436] *** and potentially your MPI job)
> [m7006.pri.cosma7.alces.network:111427] 7 more processes have sent help message help-opal-shmem-mmap.txt / target full
> [m7006.pri.cosma7.alces.network:111427] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
> [m7006.pri.cosma7.alces.network:111427] 7 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
>
> ----------------------------------------------------------------------------
> The issue only arises for certain compute nodes. As far as I understand it from the system administrator, older nodes on COSMA7 only have 68 Gbytes of storage allocated to the /tmp/ directory, whilst newer nodes have about 180 Gbytes. So if the batch system happens to allocate one or more of the older nodes, the run will fail with the above error. Apparently, these is nothing I or the administrator can do to change the /tmp size allocation. This leads to the following questions:
> • Why does MPI need to create a shared-memory backing file in the first place, and is there any way to disable this?
> • If it is a given that this file is required, is there any way of calculating in advance the value for 'Space Requested', which seems to depend on the number of compute nodes and cores assigned to the Gadget4 job in some not particularly transparent way?
>

Gadget4 allocates essentially all its memory as shared memory, such that other MPI ranks on the same node can access it directly via the shared-memory semantics allowed by MPI-3. In Linux, there are different ways for how such shared memory can be allocated. Using an actual backing file on a real filesystem (which is what happens for you) is a fairly old-fashioned approach that is not used any more by modern MPI software stacks... instead the shared memory device /dev/shm is used.

The best way to avoid your problem is to avoid using a hopelessly outdated MPI library such as openmpi-3.0.1. Best to move openmpi-4.X, then this problem should be gone.

The amount of memory requests is simly MaxMemSize times the number of MPI ranks you place on a node.

> Issue 2
> A separate, but possibly related question, concerns the amount of memory per compute node that is required for MPI operation. Is there any algorithm for calculating what this will be? For example, in one run I encounter this failure message:
>
> On node 'm6093.pri.cosma7.alces.network', we have 14 MPI ranks and at most 64362.5 MB of *shared* memory available. This is not enough space for MaxMemSize = 5000 MB
>
> With a total of >500 Gbytes of shared memory per compute note I fail to see why there is apparently only 64 Gbytes available for Gadget particle storage, etc. I can only assume that this is reserved for MPI communications, but if so, that seems a very large overhead per node.
>
> By the way, the memory allocation printout for that particular run is:
> -------------------------------------------------------------------------------------------------------------------------
> AvailMem: Largest = 127626.72 Mb (on task= 14), Smallest = 127580.92 Mb (on task= 0), Average = 127614.55 Mb
> Total Mem: Largest = 128743.13 Mb (on task= 98), Smallest = 128742.81 Mb (on task= 140), Average = 128743.10 Mb
> Committed_AS: Largest = 1162.21 Mb (on task= 0), Smallest = 1116.40 Mb (on task= 14), Average = 1128.55 Mb
> SwapTotal: Largest = 8096.00 Mb (on task= 0), Smallest = 8096.00 Mb (on task= 0), Average = 8096.00 Mb
> SwapFree: Largest = 8096.00 Mb (on task= 0), Smallest = 8096.00 Mb (on task= 0), Average = 8096.00 Mb
> AllocMem: Largest = 1162.21 Mb (on task= 0), Smallest = 1116.40 Mb (on task= 14), Average = 1128.55 Mb
> avail /dev/shm: Largest = 64368.42 Mb (on task= 14), Smallest = 64348.07 Mb (on task= 210), Average = 64359.22 Mb
> -------------------------------------------------------------------------------------------------------------------------
> Task=0 has the maximum commited memory and is host: m6093.pri.cosma7.alces.network
>

This is a problem that imfortunately regularly pops up: The shared memory available through /dev/shm is limited on many machines to 50% of the available physical memory due to an antique (and unnecessary) default setting on most Linux distributions. In this case, Gadget-4 can only use half of the memory on the nodes. See also a previous comment I made about this here:

https://wwwmpa.mpa-garching.mpg.de/gadget/gadget-list/0803.html

But there is really no deeper reason for this limit at all, and one can easily change the maximum size of /dev/shm to, for example, 95% of the physical memory. A simple
mount –o remount,size=95% /dev/shm
does the job, but this is of course only possible for administrators, and to make the change persistent between reboots, it has to be done in some system config files.

The Max-Planck Computing and Data Facility (MPCDF) has made the above the default setting for all their supercomputers and we at MPA also have adopted this for our local Freya cluster, and it completely resolved the issue, without affecting the stability of the nodes at all. Likewise, the Leibniz Supercomputing Centre has adopted this upon my request for the big SuperMUC-NG cluster. Again, this worked flawlessly. And in fact, this has also been adopted by new COSMA8 machine in Durham. So you could talk to your sys admin about this.

> Issue 3
> For the build options selected for this Gadget4 simulation, the printout gives the following memory sizes for the various data structures:
> BEGRUN: Size of particle structure 104 [bytes]
> BEGRUN: Size of sph particle structure 192 [bytes]
> BEGRUN: Size of gravity tree node 128 [bytes]
> BEGRUN: Size of neighbour tree node 152 [bytes]
> BEGRUN: Size of subfind auxiliary data 120 [bytes]
>
> Most of these make sense but I am unclear why the sph structure appears here as I am doing an DM only simulation and none of the SPH options are enabled in the build.

The SPH structure is still in a built of a DM-only simulation, and the above output is only for information purposed. No storage is allocated for SPH particles in this case.

Regards,
Volker

>
> Regards
>
> Robin
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
Received on 2021-06-29 11:54:23