Re: trouble starting a large N-body run

From: Volker Springel <volker_at_MPA-Garching.MPG.DE>
Date: Mon, 12 May 2014 21:16:51 +0200

On May 9, 2014, at 12:22 AM, Robert Thompson wrote:

> Hi Volker, thanks for your reply (and apologies for my delayed response!).
>
>> hmmm, might be that either the OS version or the particular filesystem you use does not allow you to write files larger than 2GB, or - more likely - that you cannot write data-sets larger than 2 GB with one call of fwrite. (Are your restart files that big?)
> The IC generated is ~350GB, and if spread over 512 processors I suspect the restart files would be well under 2gigs, but I could be mistaken. The code writes a few 2meg restart files just before it crashes with the error.
>
>
>> In the latter case, you could modify the wrapper my_fwrite in io.c such that writes that are larger than 2Gb are split up into several smaller calls of fwrite().
> I added a quick print statement to the my_fwrite() call like so:
>
> size_t my_fwrite(void *ptr, size_t size, size_t nmemb, FILE * stream)
> {
> size_t nwritten;
>
> if(size * nmemb > 0)
> {
> if((nwritten = fwrite(ptr, size, nmemb, stream)) != nmemb)
> {
> printf("I/O error (fwrite) on task=%d has occured: %s\n", ThisTask, strerror(errno));
> printf("trying to write %lu*%lu=%lu\n",size,nmemb,size*nmemb);
> fflush(stdout);
> endrun(777);
> }
> }
> else
> nwritten = 0;
>
> return nwritten;
> }
>
> the output for each task is:
>
> trying to write 18446744072716182480*1=18446744072716182480
>
> the number printed out for size seems unusually large. I've tried this with two different versions of Gadget-3 and it errors out in the same fashion. Any idea why this hasn't come up before for others running large n-body runs?
>

Hi Robert,

This is indeed strange. Note that the number you got here is larger than 2^63... The largest unsigned 64-bit integer is 2^64-1 = 18446744073709551615, and your number is awfully close to this. This almost certainly has arisen from a (signed) 32-bit overflow that has been promoted to 64-bit. In the restart file code, this could for example happen if "NumPart" was a negative number. I don't quite see how this could have happened in the code, but this is the only obvious possibility I see right now - perhaps your ICs are still not quite correct.

Volker


> -Robert
>
>
>
>>
>> Volker
>>
>>
>>
>>
>>> -Robert
>>>
>>>
>>> On Mar 18, 2014, at 6:47 PM, Manodeep Sinha <manodeep.sinha_at_Vanderbilt.Edu> wrote:
>>>
>>>>
>>>> On 3/18/14 11:38 AM, Robert Thompson wrote:
>>>>> Hi Volker thanks for your quick reply! I should note that the ICs were generated via N-GenIC and I am running the simulation with Gadget3.
>>>>>
>>>>>> It looks like your initial conditions file contains incorrect entries for the particle count. Note that 2250^3 > 2^32, i.e. your total particle count does not fit into an ordinary 32-bit unsigned int. In gadget2, the higher-order word is stored in a separate field in the file header (npartTotalHighWord[]).
>>>>>>
>>>>>> Check out the calculation of "All.TotNumPart" as well as of that of "All.MaxPart" in read_ic.c. For some reason you are getting All.MaxPart = 0, likely due to an incorrect value of the computed value of All.TotNumPart, which in turn probably originates in a faulty IC file header.
>>>>> I had a sneaking suspicion of this. It seems neither N-GenIC nor 2LPTic contains npartTotalHighWord, apparently the values are stored in npartTotal[1] & npartTotal[2], which interestingly enough are 0 in my IC header (probably the source of the problem). In N-GenIC I commented out NO64BITID (and enabled LONGIDS in gadget), are there any other tricks to getting it to create such large ICs?
>>>>>
>>>>>
>>>>>> Note: 128000 cores is pretty over the top for this particle count. I doubt that Gadget2 (which is nearly 10 years old) will work well for such a large number of MPI ranks - never tried it myself.
>>>>> I felt that was far too many cores myself; I figured even if I did get it to run the MPI overhead would slow it to a crawl.
>>>>>
>>>>> -Robert
>>>>>
>>>>>
>>>> Hi Robert,
>>>>
>>>> I have run into a similar issue in the past -- the public version of 2LPTic assigns the "overflow" particles into npart[2]. Line 115 in save.c reads as:
>>>>
>>>> header.npartTotal[2] = (TotNumPart >> 32);
>>>>
>>>> You need to change this to:
>>>>
>>>> header.npartTotalHighWord[1] = (TotNumPart >> 32);
>>>>
>>>> You will also need to get the updated header definition from a working copy of Gadget2. Otherwise, the HighWord field is not defined.
>>>>
>>>> In addition, since in your case Nmesh^3 exceeds UINT_MAX (2^32-1 for a 64 bit system), you will also need to modify main.c line 101 and declare nmesh3 as a double, and change the corresponding calculation for nmesh3 in line 558 and the (float) cast during the division by nmesh3 on line 625.
>>>>
>>>> Presumably, the changes for N-Genic will be at similar places - so I hope this helps.
>>>>
>>>> Cheers,
>>>> Manodeep
>>>>
>>>>>
>>>>> -----------------------------------------------------------
>>>>>
>>>>> If you wish to unsubscribe from this mailing, send mail to
>>>>> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
>>>>> A web-archive of this mailing list is available here:
>>>>> http://www.mpa-garching.mpg.de/gadget/gadget-list
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----------------------------------------------------------
>>>> If you wish to unsubscribe from this mailing, send mail to
>>>> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
>>>> A web-archive of this mailing list is available here:
>>>> http://www.mpa-garching.mpg.de/gadget/gadget-list
>>>
>>>
>>> -----------------------------------------------------------
>>>
>>> If you wish to unsubscribe from this mailing, send mail to
>>> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
>>> A web-archive of this mailing list is available here:
>>> http://www.mpa-garching.mpg.de/gadget/gadget-list
>>
>>
>>
>>
>> -----------------------------------------------------------
>>
>> If you wish to unsubscribe from this mailing, send mail to
>> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
>> A web-archive of this mailing list is available here:
>> http://www.mpa-garching.mpg.de/gadget/gadget-list
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
Received on 2014-05-12 21:16:53

This archive was generated by hypermail 2.3.0 : 2023-01-10 10:01:32 CET