Re: Restart file strategy

From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
Date: Sat, 18 Sep 2021 10:05:17 +0200

Dear Robin,

Thanks for this additional information. The excessive time comes from the gravitational unbinding in SUBFIND. This is basically this line:
SUBFIND: subfind_hbt_single_group() processing for Ngroups=28848 took 39163.5 sec

I think this is because of the parameter setting ErrTolTheta=0.5 you used, which gives an excessively small opening angle for the BH opening criterion. The latter is always used for the gravitational potential calculation in the unbinding procedure of SUBFIND. It will be perfectly fine in terms of accuracy to use a considerably larger value here, say 0.9 or 0.95. The computation time for the gravitational tree walk is very sensitive to ErrTolTheta, so I think this is likely the primary cause of the slowness of subfind you've experienced when processing the lightcone particles (in addition, the lightcone contained many halos, of course).

Bests,
Volker

> On 3. Sep 2021, at 00:28, Robin Booth <robin.booth_at_sussex.ac.uk> wrote:
>
> Hi Volker, Leonard
>
> Thanks for your respective comments.
>
> Clearly my concern about restart files was not relevant to the current issue.
>
> I attach an extract from the log file for one of the timesteps where SUBFIND is invoked.
> In case it is relevant, contents of parameters_used values is included below.
> I can see nothing in the log to indicate where the excessive time is being incurred, but it may be more apparent to one of you.
>
> If I were to recompile and run without the SUBFIND option, would the resulting executable be compatible with my existing restart files? if so, I might try this on my next run.
>
> Regards
>
> Robin
>
> -------------------------------------------------------------------------
>
>
> InitCondFile /cosma7/data/dp004/dc-boot5/Gadget4_snapshots/IC_files/Planck2013-Npart_2048_Box_3000-Fiducial
> OutputDir /cosma6/data/dp004/dc-boot5/Gadget4_snapshots/Newbuild/
> SnapshotFileBase snapshot
> ICFormat 1
> SnapFormat 3
> NumFilesPerSnapshot 8
> MaxFilesWithConcurrentIO 256
> TimeLimitCPU 160000
> CpuTimeBetRestartFile 160000
> MaxMemSize 16000
> TimeBegin 0.02
> TimeMax 1
> ComovingIntegrationOn 1
> Omega0 0.307115
> OmegaLambda 0.692885
> OmegaBaryon 0.0482519
> HubbleParam 0.6777
> Hubble 100
> BoxSize 3000
> OutputListOn 1
> OutputListFilename parameterfiles/output_list.txt
> TimeBetSnapshot 0
> TimeOfFirstSnapshot 0
> TimeBetStatistics 0.1
> ErrTolIntAccuracy 0.015
> CourantFac 0.3
> MaxSizeTimestep 0.02
> MinSizeTimestep 0
> TypeOfOpeningCriterion 1
> ErrTolTheta 0.5
> ErrTolThetaMax 1
> ErrTolForceAcc 0.005
> TopNodeFactor 3
> ActivePartFracForNewDomainDecomp 0.01
> ActivePartFracForPMinsteadOfEwald 0.05
> UnitLength_in_cm 3.08568e+24
> UnitMass_in_g 1.989e+43
> UnitVelocity_in_cm_per_s 100000
> GravityConstantInternal 0
> SofteningComovingClass0 0.05
> SofteningMaxPhysClass0 0.05
> SofteningClassOfPartType0 0
> SofteningClassOfPartType1 0
> SofteningClassOfPartType2 0
> SofteningClassOfPartType3 0
> SofteningClassOfPartType4 0
> SofteningClassOfPartType5 0
> DesNumNgb 64
> DesLinkNgb 15
> MaxNumNgbDeviation 1
> ArtBulkViscConst 1
> MinEgySpec 0
> InitGasTemp 0
> LightConeDefinitionFile parameterfiles/lightcones.txt
> From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
> Sent: 02 September 2021 09:46
> To: Gadget General Discussion <gadget-list_at_MPA-Garching.MPG.DE>
> Subject: Re: [gadget-list] Restart file strategy
>
>
> Hi Robin,
>
> I agree with Leonhard that especially the time reported for step 2415 is not reasonable at all. It appears to be caused by the FOF group finder, which normally should be very fast... Something is not right here. This is not the only odd thing that in your excerpt from balance.txt, however. I note in particular that timesteps occupied with only ~3000-4000 particles take very long - longer in fact than the bigger timesteps! This is pretty strange also. Not sure what the setup of your run is, but this is highly unusual, and especially step 2415 suggests that something isn't quite right.
>
> I'm not fully sure why you suspected the writing of the restart files is to blame. This doesn't cause extra computations, i.e. there is no global gravity calculation or tree walk induced by this. The only cost it has is I/O time, i.e. it depends primarily on the speed of the filesystem whether a too high frequency of "safety" restart-file writing is problematic or not. Normally it isn't if this is done only every > 6 hours or so. You can check for the time this takes by grepping for "RESTART:" in the stdout log-file. Yes, one can in principle complete dispense with regular intermediate restart files (whose only purpose is to resume the simulation from there in case one has a system or code crash - disabling this means one has to start from the beginning if this happens). When the code ends due to reaching TimeLimitCPU, a restart set is written in any case.
>
> Best,
> Volker
>
>
> > On 1. Sep 2021, at 13:48, Leonard Romano <leonard.romano_at_tum.de> wrote:
> >
> > Hi Robin,
> >
> > From looking at the balance.txt you have provided, I can immediately see that during the long step, there are a lot of '6' characters which correspond to either FOF or SUBFIND (look at logs.h for details). Thus I recommend trying to turn them off and see if this solves your problem.
> > The problem you are facing might be related to the force accuracy requirements you have set, which causes subfind to run for an unnecessarily long time, so lowering them might also help.
> >
> > Best,
> > Leonard
> >
> > On 01.09.21 13:36, Robin Booth wrote:
> >> Hi Volker
> >>
> >> I noticed that my Gadget4 run appeared to "stall" for several hours periodically during a run, with these instances roughly correlating to requested restart file outputs.
> >> On further investigation, from inspection of the balance.txt file for example (see extract attached), it would appear that this is caused by the code performing a Nsync-grv on the entire particle set, including an expensive fof tree walk. This pushes up the CPU step time from typically 4 minutes to around 8 hours! Does that seem reasonable to you? I assume that this process is necessary for the restart files to record the particle parameters for all particles at the same timestep.
> >> If this is indeed the correct restart file behaviour, then my conclusion would be that it is extremely counter-productive to request restart file output too frequently during a simulation run, particularly where I am running the simulation in time-limited "chunks". My understanding from the documentation is that a restart file will in any case be generated automatically when the run time approaches the limit set by the TimeLimitCPU parameter. Would you agree with this conclusion?
> >>
> >> Regards
> >>
> >> Robin
> >>
> > --
> > ===================================================
> > Leonard Romano, B.Sc.(レオナルド・ロマノ)
> > Physics Department
> > Technical University of Munich (TUM), Germany
> > Theoretical Astrophysics Group
> > Department of Earth and Space Science
> > Graduate School of Science, Osaka University, Japan
> > he / him / his
> > ===================================================
> >
> >
> > -----------------------------------------------------------
> >
> > If you wish to unsubscribe from this mailing, send mail to
> > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> > A web-archive of this mailing list is available here:
> > http://www.mpa-garching.mpg.de/gadget/gadget-list
>
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
> <subfind_log.txt>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
Received on 2021-09-18 14:19:45

This archive was generated by hypermail 2.3.0 : 2022-09-01 14:03:43 CEST