Re: Failure to restart simulation from restart files

From: Volker Springel <volker_at_MPA-Garching.MPG.DE>
Date: Tue, 22 Mar 2011 13:14:47 +0100

On Mar 21, 2011, at 6:21 PM, Saju Varghese wrote:

>
> Dear Gadget users,
>
> I was running a GADGET simulation for with 400^3 particles in a 60Mpc
> box using GADGET-3 code on Ranger (HPC at TACC), a total of 412
> snapshots from z = 99 to 0, using 512 processors.
>
> The run went smooth until snapshot 320 (z = 0.91), before the code gave
> me the following error, about halfway through the run:
> I/O error (fwrite) on task=76 has occured: Input/output error
> task 76: endrun called with an error level of 777
>
> When I tried to restart from the restart files previously written, the
> following error was produced:
> I/O error (fread) on task=76 has occured: end of file
> task 76: endrun called with an error level of 778
> Apparently one of the restart file was corrupted (filesize = 0B).
>
> As my next attempt to continue the simulation, I tried restarting the
> run from the backup restart files (restart.bak), however now I'm running
> into the following error of inconsistent time stamps:
> restart_file-time: Task0:0.522945 All:0.52129
> restart_file-time-in_if: Task0:0.522945 All:0.52129
> The restart file on task=205 is not consistent with the one on task=0
> task 205: endrun called with an error level of 16
>
> restart_file-time: Task0:0.522945 All:0.52129
> restart_file-time-in_if: Task0:0.522945 All:0.52129
> The restart file on task=333 is not consistent with the one on task=0
> task 333: endrun called with an error level of 16
>
> restart_file-time: Task0:0.522945 All:0.52129
> restart_file-time-in_if: Task0:0.522945 All:0.52129
> The restart file on task=461 is not consistent with the one on task=0
> task 461: endrun called with an error level of 16
>
> restart_file-time: Task0:0.522945 All:0.52129
> restart_file-time-in_if: Task0:0.522945 All:0.52129
> The restart file on task=77 is not consistent with the one on task=0
> task 77: endrun called with an error level of 16
>
> I modified the code above to print out the values for all_task0.Time and
> All.Time in the restart.c file:
> printf("restart_file-time: Task0:%g All:%g\n", all_task0.Time, All.Time);
> if(all_task0.Time != All.Time)
> {
>
> printf("restart_file-time-in_if: Task0:%g All:%g\n", all_task0.Time,
> All.Time);
>
> printf("The restart file on task=%d is not consistent with the one on
> task=0\n", ThisTask);
> fflush(stdout);
> endrun(16);
> }
>
> Has anyone else experienced this error before? Is there anyway I could
> restart the run without starting from the previously generated snapshot?

Hi Saju,

If the code is interrupted while writing restart files (for example because of a disk-full error), your restart.* files will be a mix of "old" and "new" files corresponding to two different timesteps. This is because the code will usually not write all the files concurrently, but does it in batches. The "new" files are for the most recent timestep, and here the code will have renamed the corresponding restart file to a ".bak" file already. But for restart files that have not been started to be written yet, the restart file from the old set is still in place, corresponding to the old timestep of the previous set of restart files.

So if you look at the time-stamps of the "restart.*" files with "ls -ltr" you should be able to identify two groups of files, one older one, and one newer one. You now need to carefully rename the restart*.bak files corresponding to the newer group of restart.* files back to restart.* (which you can also do with a script if needed). Then you will have a consistent set of "old" restart.* files again from which you can restart normally.

This problem is not caused by any MPI trouble.

Volker

> Does this mean that the previous fwrite of restart.bak was also
> corrupted due to bad MPI processes? I have heard rumors that MPI on
> Ranger is not very good.
>
> Thank you in advance,
> Saju Varghese
>
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
Received on 2011-03-22 13:14:50

This archive was generated by hypermail 2.3.0 : 2022-09-01 14:03:42 CEST