Failure to restart simulation from restart files

From: Saju Varghese <saju_at_physics.unlv.edu>
Date: Mon, 21 Mar 2011 10:21:33 -0700

Dear Gadget users,

I was running a GADGET simulation for with 400^3 particles in a 60Mpc
box using GADGET-3 code on Ranger (HPC at TACC), a total of 412
snapshots from z = 99 to 0, using 512 processors.

The run went smooth until snapshot 320 (z = 0.91), before the code gave
me the following error, about halfway through the run:
I/O error (fwrite) on task=76 has occured: Input/output error
task 76: endrun called with an error level of 777

When I tried to restart from the restart files previously written, the
following error was produced:
I/O error (fread) on task=76 has occured: end of file
task 76: endrun called with an error level of 778
Apparently one of the restart file was corrupted (filesize = 0B).

As my next attempt to continue the simulation, I tried restarting the
run from the backup restart files (restart.bak), however now I'm running
into the following error of inconsistent time stamps:
restart_file-time: Task0:0.522945 All:0.52129
restart_file-time-in_if: Task0:0.522945 All:0.52129
The restart file on task=205 is not consistent with the one on task=0
task 205: endrun called with an error level of 16

restart_file-time: Task0:0.522945 All:0.52129
restart_file-time-in_if: Task0:0.522945 All:0.52129
The restart file on task=333 is not consistent with the one on task=0
task 333: endrun called with an error level of 16

restart_file-time: Task0:0.522945 All:0.52129
restart_file-time-in_if: Task0:0.522945 All:0.52129
The restart file on task=461 is not consistent with the one on task=0
task 461: endrun called with an error level of 16

restart_file-time: Task0:0.522945 All:0.52129
restart_file-time-in_if: Task0:0.522945 All:0.52129
The restart file on task=77 is not consistent with the one on task=0
task 77: endrun called with an error level of 16

I modified the code above to print out the values for all_task0.Time and
All.Time in the restart.c file:
printf("restart_file-time: Task0:%g All:%g\n", all_task0.Time, All.Time);
if(all_task0.Time != All.Time)
{

printf("restart_file-time-in_if: Task0:%g All:%g\n", all_task0.Time,
All.Time);

printf("The restart file on task=%d is not consistent with the one on
task=0\n", ThisTask);
fflush(stdout);
endrun(16);
}

Has anyone else experienced this error before? Is there anyway I could
restart the run without starting from the previously generated snapshot?
Does this mean that the previous fwrite of restart.bak was also
corrupted due to bad MPI processes? I have heard rumors that MPI on
Ranger is not very good.

Thank you in advance,
Saju Varghese
Received on 2011-03-21 18:21:41

This archive was generated by hypermail 2.3.0 : 2023-01-10 10:01:31 CET