Gadget stalling without apparent reason

From: Enrique Vazquez <e.vazquez_at_crya.unam.mx>
Date: Tue, 6 Sep 2011 18:34:54 -0500 (CDT)

Hi,

I'm submitting some test runs on my new 180-core cluster, but I'm
encountering a very strange behavior: a run that I had performed in
my previous cluster (118^3 SPH particles, no DM particles, on 8 CPUs),
with identical parameters, and only the compile options being different
to fit the new system, runs fine on 8 cores, but when I run it run on 32
cores, it only advances up to a certain time, and then ceases to advance any
further. It generally stops somewhere near the tree calculation, either
while doing the domain decomposition, or while computing the tree force,
or while computing the potential for all particles. Other runs I'm doing
with larger SPH particle numbers (up to 27 million) on up to 128 cores
act the same way.

The strange part is that there's no crash. The system behaves as if it
had entered an infinite loop, with the processors still crunching at full
speed, but no further advance in simulation time occurs. No error messages
are displayed, and the last timestep printed is perfectly normal (0.146).
Because the problem occurs for 32 cores but not for 8, I guess it's some
problem with MPI.

I'm running Gadget2, with a few modifications by us to include extra ISM physics,
on AMD 12-core processors, using OpenMPI and Intel compilers. The connectivity
is Infiniband. Any suggestions or insight will be greatly appreciated!

Best regards,
Enrique
Received on 2011-09-07 01:35:05

This archive was generated by hypermail 2.3.0 : 2023-01-10 10:01:31 CET