Re: Restart file strategy

From: Robin Booth <robin.booth_at_sussex.ac.uk>
Date: Thu, 2 Sep 2021 22:28:05 +0000

Hi Volker, Leonard

Thanks for your respective comments.

Clearly my concern about restart files was not relevant to the current issue.

I attach an extract from the log file for one of the timesteps where SUBFIND is invoked.
In case it is relevant, contents of parameters_used values is included below.
I can see nothing in the log to indicate where the excessive time is being incurred, but it may be more apparent to one of you.

If I were to recompile and run without the SUBFIND option, would the resulting executable be compatible with my existing restart files? if so, I might try this on my next run.

Regards

Robin

-------------------------------------------------------------------------


InitCondFile /cosma7/data/dp004/dc-boot5/Gadget4_snapshots/IC_files/Planck2013-Npart_2048_Box_3000-Fiducial
OutputDir /cosma6/data/dp004/dc-boot5/Gadget4_snapshots/Newbuild/
SnapshotFileBase snapshot
ICFormat 1
SnapFormat 3
NumFilesPerSnapshot 8
MaxFilesWithConcurrentIO 256
TimeLimitCPU 160000
CpuTimeBetRestartFile 160000
MaxMemSize 16000
TimeBegin 0.02
TimeMax 1
ComovingIntegrationOn 1
Omega0 0.307115
OmegaLambda 0.692885
OmegaBaryon 0.0482519
HubbleParam 0.6777
Hubble 100
BoxSize 3000
OutputListOn 1
OutputListFilename parameterfiles/output_list.txt
TimeBetSnapshot 0
TimeOfFirstSnapshot 0
TimeBetStatistics 0.1
ErrTolIntAccuracy 0.015
CourantFac 0.3
MaxSizeTimestep 0.02
MinSizeTimestep 0
TypeOfOpeningCriterion 1
ErrTolTheta 0.5
ErrTolThetaMax 1
ErrTolForceAcc 0.005
TopNodeFactor 3
ActivePartFracForNewDomainDecomp 0.01
ActivePartFracForPMinsteadOfEwald 0.05
UnitLength_in_cm 3.08568e+24
UnitMass_in_g 1.989e+43
UnitVelocity_in_cm_per_s 100000
GravityConstantInternal 0
SofteningComovingClass0 0.05
SofteningMaxPhysClass0 0.05
SofteningClassOfPartType0 0
SofteningClassOfPartType1 0
SofteningClassOfPartType2 0
SofteningClassOfPartType3 0
SofteningClassOfPartType4 0
SofteningClassOfPartType5 0
DesNumNgb 64
DesLinkNgb 15
MaxNumNgbDeviation 1
ArtBulkViscConst 1
MinEgySpec 0
InitGasTemp 0
LightConeDefinitionFile parameterfiles/lightcones.txt
________________________________
From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
Sent: 02 September 2021 09:46
To: Gadget General Discussion <gadget-list_at_MPA-Garching.MPG.DE>
Subject: Re: [gadget-list] Restart file strategy


Hi Robin,

I agree with Leonhard that especially the time reported for step 2415 is not reasonable at all. It appears to be caused by the FOF group finder, which normally should be very fast... Something is not right here. This is not the only odd thing that in your excerpt from balance.txt, however. I note in particular that timesteps occupied with only ~3000-4000 particles take very long - longer in fact than the bigger timesteps! This is pretty strange also. Not sure what the setup of your run is, but this is highly unusual, and especially step 2415 suggests that something isn't quite right.

I'm not fully sure why you suspected the writing of the restart files is to blame. This doesn't cause extra computations, i.e. there is no global gravity calculation or tree walk induced by this. The only cost it has is I/O time, i.e. it depends primarily on the speed of the filesystem whether a too high frequency of "safety" restart-file writing is problematic or not. Normally it isn't if this is done only every > 6 hours or so. You can check for the time this takes by grepping for "RESTART:" in the stdout log-file. Yes, one can in principle complete dispense with regular intermediate restart files (whose only purpose is to resume the simulation from there in case one has a system or code crash - disabling this means one has to start from the beginning if this happens). When the code ends due to reaching TimeLimitCPU, a restart set is written in any case.

Best,
Volker


> On 1. Sep 2021, at 13:48, Leonard Romano <leonard.romano_at_tum.de> wrote:
>
> Hi Robin,
>
> From looking at the balance.txt you have provided, I can immediately see that during the long step, there are a lot of '6' characters which correspond to either FOF or SUBFIND (look at logs.h for details). Thus I recommend trying to turn them off and see if this solves your problem.
> The problem you are facing might be related to the force accuracy requirements you have set, which causes subfind to run for an unnecessarily long time, so lowering them might also help.
>
> Best,
> Leonard
>
> On 01.09.21 13:36, Robin Booth wrote:
>> Hi Volker
>>
>> I noticed that my Gadget4 run appeared to "stall" for several hours periodically during a run, with these instances roughly correlating to requested restart file outputs.
>> On further investigation, from inspection of the balance.txt file for example (see extract attached), it would appear that this is caused by the code performing a Nsync-grv on the entire particle set, including an expensive fof tree walk. This pushes up the CPU step time from typically 4 minutes to around 8 hours! Does that seem reasonable to you? I assume that this process is necessary for the restart files to record the particle parameters for all particles at the same timestep.
>> If this is indeed the correct restart file behaviour, then my conclusion would be that it is extremely counter-productive to request restart file output too frequently during a simulation run, particularly where I am running the simulation in time-limited "chunks". My understanding from the documentation is that a restart file will in any case be generated automatically when the run time approaches the limit set by the TimeLimitCPU parameter. Would you agree with this conclusion?
>>
>> Regards
>>
>> Robin
>>
> --
> ===================================================
> Leonard Romano, B.Sc.(レオナルド・瘢雹ロマノ)
> Physics Department
> Technical University of Munich (TUM), Germany
> Theoretical Astrophysics Group
> Department of Earth and Space Science
> Graduate School of Science, Osaka University, Japan
> he / him / his
> ===================================================
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list<http://www.mpa-garching.mpg.de/gadget/gadget-list>




-----------------------------------------------------------

If you wish to unsubscribe from this mailing, send mail to
minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
A web-archive of this mailing list is available here:
http://www.mpa-garching.mpg.de/gadget/gadget-list<http://www.mpa-garching.mpg.de/gadget/gadget-list>




Received on 2021-09-03 00:28:21

This archive was generated by hypermail 2.3.0 : 2022-09-01 14:03:43 CEST