Re: trouble starting a large N-body run

From: Volker Springel <volker_at_MPA-Garching.MPG.DE>
Date: Tue, 18 Mar 2014 17:13:55 +0100

On Mar 18, 2014, at 4:26 PM, Robert Thompson wrote:

> Hi everyone,
> I am attempting to run a large N-body only run with 2250^3 particles (IC file is ~257Gb). Launching the job on anything less than 128,000 processors results in an error when reading in & distributing the first IC file. It errors out with code 173 and says:
> Not enough space on task=36 (space for 0, need at least 90877)
> The last number varies depending on the number of cores I chose where a higher core count results in a lower number in place of 90877. Once I reach 128,000 cores I finally begin to run into memory issues related the the individual nodes (endrun 18):
> failed to allocate 62500 MB of memory. (presently allocated=4.39453 MB)
> I am currently trying this on BlueWaters where each node has 32processors and 64GB of memory. I have tried various PMGRID options (=256-2048) all with similar results. The particle count seems roughly equivalent to the Millennium simulation which only used 512 processing cores according to the website. I have attempted to use fewer processors per node but regardless of the configuration I end up with an endrun 173 unless I am launching with ~128,000 cores. I have the PartAllocFactor set to its minimum recommended value of 1.05 as well. Any advice on how to reduce the required core count?

Hi Robert,

It looks like your initial conditions file contains incorrect entries for the particle count. Note that 2250^3 > 2^32, i.e. your total particle count does not fit into an ordinary 32-bit unsigned int. In gadget2, the higher-order word is stored in a separate field in the file header (npartTotalHighWord[]).

Check out the calculation of "All.TotNumPart" as well as of that of "All.MaxPart" in read_ic.c. For some reason you are getting All.MaxPart = 0, likely due to an incorrect value of the computed value of All.TotNumPart, which in turn probably originates in a faulty IC file header.

Note: 128000 cores is pretty over the top for this particle count. I doubt that Gadget2 (which is nearly 10 years old) will work well for such a large number of MPI ranks - never tried it myself.


-Robert Thompson
