Re: trouble starting a large N-body run

From: Robert Thompson <rthompsonj_at_gmail.com>
Date: Fri, 9 May 2014 00:22:06 +0200

Hi Volker, thanks for your reply (and apologies for my delayed response!).

> hmmm, might be that either the OS version or the particular filesystem you use does not allow you to write files larger than 2GB, or - more likely - that you cannot write data-sets larger than 2 GB with one call of fwrite. (Are your restart files that big?)
The IC generated is ~350GB, and if spread over 512 processors I suspect the restart files would be well under 2gigs, but I could be mistaken. The code writes a few 2meg restart files just before it crashes with the error.


> In the latter case, you could modify the wrapper my_fwrite in io.c such that writes that are larger than 2Gb are split up into several smaller calls of fwrite().
I added a quick print statement to the my_fwrite() call like so:

size_t my_fwrite(void *ptr, size_t size, size_t nmemb, FILE * stream)
{
  size_t nwritten;

  if(size * nmemb > 0)
    {
      if((nwritten = fwrite(ptr, size, nmemb, stream)) != nmemb)
        {
          printf("I/O error (fwrite) on task=%d has occured: %s\n", ThisTask, strerror(errno));
          printf("trying to write %lu*%lu=%lu\n",size,nmemb,size*nmemb);
          fflush(stdout);
          endrun(777);
        }
    }
  else
    nwritten = 0;

  return nwritten;
}

the output for each task is:

trying to write 18446744072716182480*1=18446744072716182480

the number printed out for size seems unusually large. I've tried this with two different versions of Gadget-3 and it errors out in the same fashion. Any idea why this hasn't come up before for others running large n-body runs?

-Robert



>
> Volker
>
>
>
>
>> -Robert
>>
>>
>> On Mar 18, 2014, at 6:47 PM, Manodeep Sinha <manodeep.sinha_at_Vanderbilt.Edu> wrote:
>>
>>>
>>> On 3/18/14 11:38 AM, Robert Thompson wrote:
>>>> Hi Volker thanks for your quick reply! I should note that the ICs were generated via N-GenIC and I am running the simulation with Gadget3.
>>>>
>>>>> It looks like your initial conditions file contains incorrect entries for the particle count. Note that 2250^3 > 2^32, i.e. your total particle count does not fit into an ordinary 32-bit unsigned int. In gadget2, the higher-order word is stored in a separate field in the file header (npartTotalHighWord[]).
>>>>>
>>>>> Check out the calculation of "All.TotNumPart" as well as of that of "All.MaxPart" in read_ic.c. For some reason you are getting All.MaxPart = 0, likely due to an incorrect value of the computed value of All.TotNumPart, which in turn probably originates in a faulty IC file header.
>>>> I had a sneaking suspicion of this. It seems neither N-GenIC nor 2LPTic contains npartTotalHighWord, apparently the values are stored in npartTotal[1] & npartTotal[2], which interestingly enough are 0 in my IC header (probably the source of the problem). In N-GenIC I commented out NO64BITID (and enabled LONGIDS in gadget), are there any other tricks to getting it to create such large ICs?
>>>>
>>>>
>>>>> Note: 128000 cores is pretty over the top for this particle count. I doubt that Gadget2 (which is nearly 10 years old) will work well for such a large number of MPI ranks - never tried it myself.
>>>> I felt that was far too many cores myself; I figured even if I did get it to run the MPI overhead would slow it to a crawl.
>>>>
>>>> -Robert
>>>>
>>>>
>>> Hi Robert,
>>>
>>> I have run into a similar issue in the past -- the public version of 2LPTic assigns the "overflow" particles into npart[2]. Line 115 in save.c reads as:
>>>
>>> header.npartTotal[2] = (TotNumPart >> 32);
>>>
>>> You need to change this to:
>>>
>>> header.npartTotalHighWord[1] = (TotNumPart >> 32);
>>>
>>> You will also need to get the updated header definition from a working copy of Gadget2. Otherwise, the HighWord field is not defined.
>>>
>>> In addition, since in your case Nmesh^3 exceeds UINT_MAX (2^32-1 for a 64 bit system), you will also need to modify main.c line 101 and declare nmesh3 as a double, and change the corresponding calculation for nmesh3 in line 558 and the (float) cast during the division by nmesh3 on line 625.
>>>
>>> Presumably, the changes for N-Genic will be at similar places - so I hope this helps.
>>>
>>> Cheers,
>>> Manodeep
>>>
>>>>
>>>> -----------------------------------------------------------
>>>>
>>>> If you wish to unsubscribe from this mailing, send mail to
>>>> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
>>>> A web-archive of this mailing list is available here:
>>>> http://www.mpa-garching.mpg.de/gadget/gadget-list
>>>>
>>>
>>>
>>>
>>>
>>> -----------------------------------------------------------
>>> If you wish to unsubscribe from this mailing, send mail to
>>> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
>>> A web-archive of this mailing list is available here:
>>> http://www.mpa-garching.mpg.de/gadget/gadget-list
>>
>>
>> -----------------------------------------------------------
>>
>> If you wish to unsubscribe from this mailing, send mail to
>> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
>> A web-archive of this mailing list is available here:
>> http://www.mpa-garching.mpg.de/gadget/gadget-list
>
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
Received on 2014-05-09 00:22:11

This archive was generated by hypermail 2.3.0 : 2023-01-10 10:01:32 CET