Re: Gadget2 fatal error

From: Cameron McBride <cameron_at_phyast.pitt.edu>
Date: Fri, 19 Oct 2007 15:32:54 -0400

Hello,

Volker Springel wrote (17 Oct 2007 07:28 EDT):
> This looks like a memory allocation problem.
>
> In domain.c, there is only one call of MPI_Allgatherv(), line 953.
> A couple of lines above, the buffer toplist is allocated with
>
> toplist = malloc(ntop * sizeof(struct topnode_exchange));

Indeed, that seemed to be the culprit of the error I'm seeing.

> You could check for this issue by changing the allocation statement to
>
> if(!(toplist = malloc(ntop * sizeof(struct topnode_exchange))))
> endrun(11231);

Which I added. Additionally, from a suggestion by luca, I also output
a little more info before the endrun:

  MPI_Barrier(MPI_COMM_WORLD);
  printf("DEBUG: %d :: %p %p\n", ThisTask, toplist_local, toplist);

Which basically let me know that toplist_local was fine, and the toplist
allocation seems to fail for every task.

I've now run this a few times, and it seems to consistently fail in TWO
places. Sometimes it makes it through the 1st domain decomposition,
other times it fails in the domain decomp in Step 0 (which is the second
one).

But why would it be failing? It succeeds the first time through on several
occasions - what significant differences are there in this function for
the 2nd run (in Step 0)? Also, it seems the memory requirements aren't
near the machine maximum (~1G per PE), looking at the output file:

  % grep 'llocated' n1250_b640_pm2048_pe1024_debug.pbs.o196587
  Allocated 100 MByte communication buffer per processor.
  Allocated 218.279 MByte for particle storage. 80
  allocated 0.0762939 Mbyte for ngb search.
  Allocated 214.676 MByte for BH-tree. 64

Which is just over 500 MB. I've tried it with the PM off, and it also
fails (not too surprising, according to the output Gadget2 hasn't
allocated space for the PMGRID yet). Due to the slight change in size
to struct particle_data, the PMGRID being off saves us a bit on the per
PE memory:

  Allocated 185.537 MByte for particle storage. 68

Based on the 960^3 runs, I'd expect the FFT to require about 170 MB per
PE for a PMGRID=2048 on 1024 PE - which might be pushing it, but it
should still work.

For different runs on a smaller number of PE, I've been able to get
Gadget2 to work with higher reported memory statistics on this same
platform (Cray XT3), specifically, with FFT data: 610 MB per PE.

Sorry, the debugging cycle is a little slower since I have to wade
through the queue for the large number of PE. I've got two things
pending right now:
  1. a run on 1536 PE.
  2. a run that will output the input values of failing malloc()

Any further suggestions on where to look?

Cameron
Received on 2007-10-19 21:32:55

This archive was generated by hypermail 2.3.0 : 2023-01-10 10:01:30 CET