Re: Restart file strategy

From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
Date: Thu, 2 Sep 2021 10:46:39 +0200

Hi Robin,

I agree with Leonhard that especially the time reported for step 2415 is not reasonable at all. It appears to be caused by the FOF group finder, which normally should be very fast... Something is not right here. This is not the only odd thing that in your excerpt from balance.txt, however. I note in particular that timesteps occupied with only ~3000-4000 particles take very long - longer in fact than the bigger timesteps! This is pretty strange also. Not sure what the setup of your run is, but this is highly unusual, and especially step 2415 suggests that something isn't quite right.

I'm not fully sure why you suspected the writing of the restart files is to blame. This doesn't cause extra computations, i.e. there is no global gravity calculation or tree walk induced by this. The only cost it has is I/O time, i.e. it depends primarily on the speed of the filesystem whether a too high frequency of "safety" restart-file writing is problematic or not. Normally it isn't if this is done only every > 6 hours or so. You can check for the time this takes by grepping for "RESTART:" in the stdout log-file. Yes, one can in principle complete dispense with regular intermediate restart files (whose only purpose is to resume the simulation from there in case one has a system or code crash - disabling this means one has to start from the beginning if this happens). When the code ends due to reaching TimeLimitCPU, a restart set is written in any case.

Best,
Volker


> On 1. Sep 2021, at 13:48, Leonard Romano <leonard.romano_at_tum.de> wrote:
>
> Hi Robin,
>
> From looking at the balance.txt you have provided, I can immediately see that during the long step, there are a lot of '6' characters which correspond to either FOF or SUBFIND (look at logs.h for details). Thus I recommend trying to turn them off and see if this solves your problem.
> The problem you are facing might be related to the force accuracy requirements you have set, which causes subfind to run for an unnecessarily long time, so lowering them might also help.
>
> Best,
> Leonard
>
> On 01.09.21 13:36, Robin Booth wrote:
>> Hi Volker
>>
>> I noticed that my Gadget4 run appeared to "stall" for several hours periodically during a run, with these instances roughly correlating to requested restart file outputs.
>> On further investigation, from inspection of the balance.txt file for example (see extract attached), it would appear that this is caused by the code performing a Nsync-grv on the entire particle set, including an expensive fof tree walk. This pushes up the CPU step time from typically 4 minutes to around 8 hours! Does that seem reasonable to you? I assume that this process is necessary for the restart files to record the particle parameters for all particles at the same timestep.
>> If this is indeed the correct restart file behaviour, then my conclusion would be that it is extremely counter-productive to request restart file output too frequently during a simulation run, particularly where I am running the simulation in time-limited "chunks". My understanding from the documentation is that a restart file will in any case be generated automatically when the run time approaches the limit set by the TimeLimitCPU parameter. Would you agree with this conclusion?
>>
>> Regards
>>
>> Robin
>>
> --
> ===================================================
> Leonard Romano, B.Sc.(レオナルド・ロマノ)
> Physics Department
> Technical University of Munich (TUM), Germany
> Theoretical Astrophysics Group
> Department of Earth and Space Science
> Graduate School of Science, Osaka University, Japan
> he / him / his
> ===================================================
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
Received on 2021-09-02 10:46:57

This archive was generated by hypermail 2.3.0 : 2022-09-01 14:03:43 CEST