Re: Gadget2 fatal error

From: Cameron McBride <cameron_at_phyast.pitt.edu>
Date: Sat, 20 Oct 2007 10:46:36 -0400

Cameron McBride wrote (19 Oct 2007 15:32 EDT):
> Sorry, the debugging cycle is a little slower since I have to wade
> through the queue for the large number of PE. I've got two things
> pending right now:
> 1. a run on 1536 PE.
> 2. a run that will output the input values of failing malloc()

Updates on both of these:

1. Still failed on 1526 PE.

   % grep 'llocated' n1250_b640_pm2048_pe1536_debug.pbs.o197554
   Allocated 100 MByte communication buffer per processor.
   Allocated 145.519 MByte for particle storage. 80
   allocated 0.0762939 Mbyte for ngb search.
   Allocated 143.372 MByte for BH-tree. 64

   which puts reported per node memory at less than 400 MB.

   The arguments to the first successful malloc:
    toplist = malloc( 12392355 * 16 );
    (189 MB)

   The arguments of the failing malloc:
    toplist = malloc( 16818280 * 16);
    (256 MB)

   It's interesting that only some nodes failed to malloc
   in the domain decomp for Step 0.


2. More info on the 1024 PE attempts:

   % grep 'llocated' n1250_b640_pm2048_pe1024_debug.pbs.o197588
   Allocated 100 MByte communication buffer per processor.
   Allocated 218.279 MByte for particle storage. 80
   allocated 0.0762939 Mbyte for ngb search.
   Allocated 214.676 MByte for BH-tree. 64

   which puts reported per node memory at about 530 MB.

   First successful attempt:
    toplist = malloc( 9685531 * 16 );
    (147.8 MB)

   Second failed malloc:
    toplist = malloc( 9731052 * 16);
    (148.5 MB)

The differences seem really small, especially in the 1024 PE case, to be
from the memory we've considered here - but I don't understand why else
a simple malloc() would fail except for lack of memory.

My next best guess is to try and do a full memory profile, since
it appears there is a chunk of significant memory that is allocated and
not reported between the initial domain decomposition and the one in
Step 0.

Also, it seems that increasing the number of PE (available memory)
doesn't fix this since the toplist memory requirements go up with more
PE. (TOPNODEFACTOR=2.0 in the above cases)

Could a corrupted input particle file cause any of this? (Seems
unlikely since the first domain decomp was successful)

Does someone with more understanding of Gadget2 have some suggestions or
see something I'm missing?

Thanks.

Cameron
Received on 2007-10-20 16:46:37

This archive was generated by hypermail 2.3.0 : 2023-01-10 10:01:30 CET