Various MPI and memory allocation issues

From: Robin Booth <robin.booth_at_sussex.ac.uk>
Date: Fri, 25 Jun 2021 15:49:01 +0000

Hi Volker

I am carrying out various tests in advance of a fairly large (2^33 particles) Gadget4 run on the COSMA7 cluster at Durham, and I am encountering some issues relating to MPI operation and memory allocation. Currently I am running Gadget4 using the open_mpi/3.0.1 library.

Issue 1
The first issue is the occurrence of this error message, which I assume is generated by the MPI library rather than by the OS or by Gadget itself:

--------------------------------------------------------------------------
It appears as if there is not enough space for /tmp/ompi.m7007.21208/jf.44296/1/15/shared_window_5.m7007 (the shared-memory backing
file). It is likely that your MPI job will now either abort or experience
performance degradation.

  Local host: m7007.pri.cosma7.alces.network
  Space Requested: 78643266120 B
  Space Available: 67878432768 B
--------------------------------------------------------------------------
[m7007:106436] *** An error occurred in MPI_Win_allocate_shared
[m7007:106436] *** reported by process [47573960753153,15]
[m7007:106436] *** on communicator MPI COMMUNICATOR 3 SPLIT_TYPE FROM 0
[m7007:106436] *** MPI_ERR_INTERN: internal error
[m7007:106436] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[m7007:106436] *** and potentially your MPI job)
[m7006.pri.cosma7.alces.network:111427] 7 more processes have sent help message help-opal-shmem-mmap.txt / target full
[m7006.pri.cosma7.alces.network:111427] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[m7006.pri.cosma7.alces.network:111427] 7 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal

----------------------------------------------------------------------------
The issue only arises for certain compute nodes. As far as I understand it from the system administrator, older nodes on COSMA7 only have 68 Gbytes of storage allocated to the /tmp/ directory, whilst newer nodes have about 180 Gbytes. So if the batch system happens to allocate one or more of the older nodes, the run will fail with the above error. Apparently, these is nothing I or the administrator can do to change the /tmp size allocation. This leads to the following questions:

  * Why does MPI need to create a shared-memory backing file in the first place, and is there any way to disable this?
  * If it is a given that this file is required, is there any way of calculating in advance the value for 'Space Requested', which seems to depend on the number of compute nodes and cores assigned to the Gadget4 job in some not particularly transparent way?
  *

Issue 2
A separate, but possibly related question, concerns the amount of memory per compute node that is required for MPI operation. Is there any algorithm for calculating what this will be? For example, in one run I encounter this failure message:

On node 'm6093.pri.cosma7.alces.network', we have 14 MPI ranks and at most 64362.5 MB of *shared* memory available. This is not enough space for MaxMemSize = 5000 MB

With a total of >500 Gbytes of shared memory per compute note I fail to see why there is apparently only 64 Gbytes available for Gadget particle storage, etc. I can only assume that this is reserved for MPI communications, but if so, that seems a very large overhead per node.

By the way, the memory allocation printout for that particular run is:
-------------------------------------------------------------------------------------------------------------------------
AvailMem: Largest = 127626.72 Mb (on task= 14), Smallest = 127580.92 Mb (on task= 0), Average = 127614.55 Mb
Total Mem: Largest = 128743.13 Mb (on task= 98), Smallest = 128742.81 Mb (on task= 140), Average = 128743.10 Mb
Committed_AS: Largest = 1162.21 Mb (on task= 0), Smallest = 1116.40 Mb (on task= 14), Average = 1128.55 Mb
SwapTotal: Largest = 8096.00 Mb (on task= 0), Smallest = 8096.00 Mb (on task= 0), Average = 8096.00 Mb
SwapFree: Largest = 8096.00 Mb (on task= 0), Smallest = 8096.00 Mb (on task= 0), Average = 8096.00 Mb
AllocMem: Largest = 1162.21 Mb (on task= 0), Smallest = 1116.40 Mb (on task= 14), Average = 1128.55 Mb
avail /dev/shm: Largest = 64368.42 Mb (on task= 14), Smallest = 64348.07 Mb (on task= 210), Average = 64359.22 Mb
-------------------------------------------------------------------------------------------------------------------------
Task=0 has the maximum commited memory and is host: m6093.pri.cosma7.alces.network

Issue 3
For the build options selected for this Gadget4 simulation, the printout gives the following memory sizes for the various data structures:
BEGRUN: Size of particle structure 104 [bytes]
BEGRUN: Size of sph particle structure 192 [bytes]
BEGRUN: Size of gravity tree node 128 [bytes]
BEGRUN: Size of neighbour tree node 152 [bytes]
BEGRUN: Size of subfind auxiliary data 120 [bytes]

Most of these make sense but I am unclear why the sph structure appears here as I am doing an DM only simulation and none of the SPH options are enabled in the build.

Regards

Robin
Received on 2021-06-25 17:49:21

This archive was generated by hypermail 2.3.0 : 2022-09-01 14:03:43 CEST