Re: Corrupt Particle ID's in Snapshot Files

From: Volker Springel <volker_at_MPA-Garching.MPG.DE>
Date: Wed, 08 Nov 2006 14:56:54 +0100

Hi Matthew,

This is a strange problem. I only encountered something similar once, in
the form of a section of ID-values in a snapshot file that contained
garbage - very similar to your symptoms. In this case it was a corrupted
filesystem on a RAID server.

It could be that you have you some kind of hardware problem here too.
One thing that you could do is to put in a test into the code that
writes the snapshot files, which would explicitly check whether the IDs
that the code wants to write to disk contain sensible values.

To to this, you could replace the line

    my_fwrite(CommBuffer, bytes_per_blockelement, pc, fd);

in io.c with something like:

  {
    if(blocknr == IO_ID)
      {
         int i, *idp;
          idp = CommBuffer;
         for(i=0; i< pc; i++, idp++)
           if(*idp < 0 || *idp > 256*256*256)
              {
                 printf("Wrong ID found! i=%d ID=%d\n", i, *idp);
                 endrun(1);
              }
       }
    my_fwrite(CommBuffer, bytes_per_blockelement, pc, fd);
  }

This would terminate the simulation if it tries to write IDs that are
not sensible - if this should occur, there could either be an as of yet
undiscovered bug in io.c, or something goes wrong with the MPI
communication. If the above error condition is not encountered and your
snapshots are nevertheless found to be corrupted, your filesystem may
have a problem.

Volker


Matthew Francis wrote:
> Hi All,
>
> I'm getting an odd error in some of my GADGET2 snapshots. Occasionally
> I'll get a bunch of particle ID's that look corrupted. For example in a
> file containing 16777216 particles (256^3), I checked all the ID's to
> see if they lay between 1-N_particles, and found the following:
>
> Particle ID: Position in Snapshot:
> -2071986048 6014114
> 1521338276 6014115
> 33554432 6014117
> -1551499008 6014118
> 50331648 6014119
> 117440512 6014120
> 536870912 6014122
> -1003326207 6014123
> 100663296 6014124
> 1701080942 6014125
> -301793280 6014127
> -368902144 6014128
> -368902144 6014130
> 603979776 6014133
> 100663297 6014134
> 33556480 6014135
>
> All the other particles in the file had sensible ID's, so it's only ~20
> in ~16 million that are affect. As you can see the particles affected
> are next to each other in the list of particle in the snapshot file. The
> positions of the particles with the crazy IDs look sensible, i.e. they
> lie within the box, just the IDs are affected. The other strange thing
> about this is that in a given run it won't affect all the snapshots, so
> the corrupt ID's are not being carried through the simulation. Roughly 1
> in 5 snapshots I write out show this strange behaviour.
>
> I'm using the standard GADGET2 snapshot format, but not HDF5.
>
> I've recently started using GADGET2 on a new dedicated cluster, rather
> than the network I had been using. On the old system I had no problems,
> this has only arisen since moving to the new cluster. I'll try and give
> as much information about this system as I can. It's a 16 node cluster
> with each node containing 2 duel core AMD Opteron processors with 4GB
> per node. I'm running my simulations with 256^3 particles on 2 nodes
> each, so 8 processors and 8GB of RAM in total. The nodes are running
> Open SuSE Linux 10.1, using mpich2 to implement mpi.
>
> Has anyone seen an error like this before or can suggest a possible
> solution? It's not a terminal issue for me, since for the moment I'm not
> worried about tracking individual particles, but I may be in the future
> and in any case am concerned that this may be pointing to a deeper
> issue.
>
> Regards
>
> Matt Francis
>
Received on 2006-11-08 14:56:54

This archive was generated by hypermail 2.3.0 : 2023-01-10 10:01:30 CET