Re: Restart file strategy

From: Leonard Romano <leonard.romano_at_tum.de>
Date: Fri, 3 Sep 2021 10:46:34 +0200

Dear Robin,


you can find the excessive time usage in lines 105-107:

    SUBFIND: subfind_hbt_single_group() processing for Ngroups=28848
    took 39163.5 sec
    SUBFIND: root-task=0: Serial processing of halo 0 took 39172.3
    SUBFIND: Processing overall took  (total time=39535.3 sec)

However it is hard for me to tell what exactly could go wrong at that point.
I recommend to check the corresponding function in the code, hopefully
you can find something.


I think since SUBFIND allocates some fields recompiling and trying to
restart from there would mess with the binary format and would probably
not work.
You could try to enforce the writing of a snapshot by modifying your
output list and restart from there.


Best,
Leonard


On 03.09.21 00:28, Robin Booth wrote:
> Hi Volker, Leonard
>
> Thanks for your respective comments.
>
> Clearly my concern about restart files was not relevant to the current
> issue.
>
> I attach an extract from the log file for one of the timesteps where
> SUBFIND is invoked.
> In case it is relevant, contents of parameters_used values is included
> below.
> I can see nothing in the log to indicate where the excessive time is
> being incurred, but it may be more apparent to one of you.
>
> If I were to recompile and run without the SUBFIND option, would the
> resulting executable be compatible with my existing restart files? if
> so, I might try this on my next run.
>
> Regards
>
> Robin
>
> -------------------------------------------------------------------------
>
>
> InitCondFile
>  /cosma7/data/dp004/dc-boot5/Gadget4_snapshots/IC_files/Planck2013-Npart_2048_Box_3000-Fiducial
>
> OutputDir /cosma6/data/dp004/dc-boot5/Gadget4_snapshots/Newbuild/
> SnapshotFileBase                                  snapshot
> ICFormat                                          1
> SnapFormat                                        3
> NumFilesPerSnapshot                               8
> MaxFilesWithConcurrentIO                          256
> TimeLimitCPU                                      160000
> CpuTimeBetRestartFile                             160000
> MaxMemSize                                        16000
> TimeBegin                                         0.02
> TimeMax                                           1
> ComovingIntegrationOn                             1
> Omega0                                            0.307115
> OmegaLambda                                       0.692885
> OmegaBaryon                                       0.0482519
> HubbleParam                                       0.6777
> Hubble                                            100
> BoxSize                                           3000
> OutputListOn                                      1
> OutputListFilename  parameterfiles/output_list.txt
> TimeBetSnapshot                                   0
> TimeOfFirstSnapshot                               0
> TimeBetStatistics                                 0.1
> ErrTolIntAccuracy                                 0.015
> CourantFac                                        0.3
> MaxSizeTimestep                                   0.02
> MinSizeTimestep                                   0
> TypeOfOpeningCriterion                            1
> ErrTolTheta                                       0.5
> ErrTolThetaMax                                    1
> ErrTolForceAcc                                    0.005
> TopNodeFactor                                     3
> ActivePartFracForNewDomainDecomp                  0.01
> ActivePartFracForPMinsteadOfEwald                 0.05
> UnitLength_in_cm  3.08568e+24
> UnitMass_in_g                                     1.989e+43
> UnitVelocity_in_cm_per_s                          100000
> GravityConstantInternal                           0
> SofteningComovingClass0                           0.05
> SofteningMaxPhysClass0                            0.05
> SofteningClassOfPartType0                         0
> SofteningClassOfPartType1                         0
> SofteningClassOfPartType2                         0
> SofteningClassOfPartType3                         0
> SofteningClassOfPartType4                         0
> SofteningClassOfPartType5                         0
> DesNumNgb                                         64
> DesLinkNgb                                        15
> MaxNumNgbDeviation                                1
> ArtBulkViscConst                                  1
> MinEgySpec                                        0
> InitGasTemp                                       0
> LightConeDefinitionFile parameterfiles/lightcones.txt
> ------------------------------------------------------------------------
> *From:* Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
> *Sent:* 02 September 2021 09:46
> *To:* Gadget General Discussion <gadget-list_at_MPA-Garching.MPG.DE>
> *Subject:* Re: [gadget-list] Restart file strategy
>
> Hi Robin,
>
> I agree with Leonhard that especially the time reported for step 2415
> is not reasonable at all. It appears to be caused by the FOF group
> finder, which normally should be very fast... Something is not right
> here. This is not the only odd thing that in your excerpt from
> balance.txt, however. I note in particular that timesteps occupied
> with only ~3000-4000 particles take very long - longer in fact than
> the bigger timesteps! This is pretty strange also. Not sure what the
> setup of your run is, but this is highly unusual, and especially step
> 2415 suggests that something isn't quite right.
>
> I'm not fully sure why you suspected the writing of the restart files
> is to blame. This doesn't cause extra computations, i.e. there is no
> global gravity calculation or tree walk induced by this. The only cost
> it has is I/O time, i.e. it depends primarily on the speed of the
> filesystem whether a too high frequency of "safety" restart-file
> writing is problematic or not. Normally it isn't if this is done only
> every > 6 hours or so. You can check for the time this takes by
> grepping for "RESTART:" in the stdout log-file. Yes, one can in
> principle complete dispense with regular intermediate restart files
> (whose only purpose is to resume the simulation from there in case one
> has a system or code crash - disabling this means one has to start
> from the beginning if this happens). When the code ends due to
> reaching TimeLimitCPU, a restart set is written in any case.
>
> Best,
> Volker
>
>
> > On 1. Sep 2021, at 13:48, Leonard Romano <leonard.romano_at_tum.de> wrote:
> >
> > Hi Robin,
> >
> > From looking at the balance.txt you have provided, I can immediately
> see that during the long step, there are a lot of '6' characters which
> correspond to either FOF or SUBFIND (look at logs.h for details). Thus
> I recommend trying to turn them off and see if this solves your problem.
> > The problem you are facing might be related to the force accuracy
> requirements you have set, which causes subfind to run for an
> unnecessarily long time, so lowering them might also help.
> >
> > Best,
> > Leonard
> >
> > On 01.09.21 13:36, Robin Booth wrote:
> >> Hi Volker
> >>
> >> I noticed that my Gadget4 run appeared to "stall" for several hours
> periodically during a run, with these instances roughly correlating to
> requested restart file outputs.
> >> On further investigation, from inspection of the balance.txt file
> for example (see extract attached), it would appear that this is
> caused by the code performing a Nsync-grv on the entire particle set,
> including an expensive fof tree walk. This pushes up the CPU step time
> from typically 4 minutes to around 8 hours! Does that seem reasonable
> to you? I assume that this process is necessary for the restart files
> to record the particle parameters for all particles at the same timestep.
> >> If this is indeed the correct restart file behaviour, then my
> conclusion would be that it is extremely counter-productive to request
> restart file output too frequently during a simulation run,
> particularly where I am running the simulation in time-limited
> "chunks". My understanding from the documentation is that a restart
> file will in any case be generated automatically when the run time
> approaches the limit set by the TimeLimitCPU parameter. Would you
> agree with this conclusion?
> >>
> >> Regards
> >>
> >> Robin
> >>
> > --
> > ===================================================
> > Leonard Romano, B.Sc.(レオナルド・ロマノ)
> > Physics Department
> > Technical University of Munich (TUM), Germany
> > Theoretical Astrophysics Group
> > Department of Earth and Space Science
> > Graduate School of Science, Osaka University, Japan
> > he / him / his
> > ===================================================
> >
> >
> > -----------------------------------------------------------
> >
> > If you wish to unsubscribe from this mailing, send mail to
> > minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe
> gadget-list
> > A web-archive of this mailing list is available here:
> > http://www.mpa-garching.mpg.de/gadget/gadget-list
> <http://www.mpa-garching.mpg.de/gadget/gadget-list>
>
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
> <http://www.mpa-garching.mpg.de/gadget/gadget-list>

-- 
===================================================
Leonard Romano, B.Sc.(レオナルド・ロマノ)
Physics Department
Technical University of Munich (TUM), Germany
Theoretical Astrophysics Group
Department of Earth and Space Science
Graduate School of Science, Osaka University, Japan
he / him / his
===================================================
Received on 2021-09-03 10:46:40

This archive was generated by hypermail 2.3.0 : 2023-01-10 10:01:33 CET