Re: Fwd: Troubles getting Gadget2 running on a cluster

From: Volker Springel <volker_at_MPA-Garching.MPG.DE>
Date: Mon, 18 Feb 2008 10:40:54 +0100

Hi Gregory,

This looks to me like some kind of problem with the MPI library.

One thing you could try is to locate all calls of "system()" in the
source code of gadget2, and simply comment them out. (They are not
needed for the calculation.) On some MPI-2 installations (like
mvapich2), calls of system() are incompatible with the MPI library
because of problems with certain pinned memory pages.

In general, it would probably be a good idea to try a MPI test program,
and if this reports problem, another MPI installation (MPICH2, OpenMPI,
etc.). Also, it would be important to see whether it makes a difference
if you run a job with 4 MPI tasks scattered over different nodes, or all
assembled on a single node.

Volker


Gregory Poole wrote:
>
> Greetings everyone,
>
> I'm having troubles getting Gadget2 to run stably on our cluster here
> at Swinburne. It runs for a few time steps and then crashes (at
> seemingly random times) with the following cryptic error message:
>
> p12_12162: (7355.398438) net_recv failed for fd = 80
> p12_12162: p4_error: net_recv read, errno = : 110
> rm_l_12_12722: (7355.398438) net_send: could not write to fd=5, errno
> = 32
>
> Our system consists of dual quad-core AMD machines with Gigabit
> interconnect running on Cent OS 5.
>
> I got word from a friend that it may be a stack problem and I tried
> calling the following routine after MPI_Init:
>
> void setstacklim__(void)
> {
> struct rlimit old_Limit;
> struct rlimit new_Limit;
> int old_Limit_grval;
> int new_Limit_srval;
> int new_Limit_grval;
>
> old_Limit_grval =getrlimit(RLIMIT_STACK,&old_Limit);
> new_Limit.rlim_cur =RLIM_INFINITY;
> new_Limit.rlim_max =RLIM_INFINITY;
> new_Limit_srval =setrlimit(RLIMIT_STACK,&new_Limit);
> new_Limit_grval =getrlimit(RLIMIT_STACK,&new_Limit);
> printf("\n rvals=(%d,%d,%d) Limits were=(%d,%d) and now are (%d,%
> d); RLIMIT_INFINITY=%d\n",
> old_Limit_grval,new_Limit_srval,new_Limit_grval,
> old_Limit.rlim_cur,old_Limit.rlim_max,
> new_Limit.rlim_cur,new_Limit.rlim_max,RLIM_INFINITY);
> }
>
> The output from this routine is:
>
> rvals=(0,0,0) Limits were=(10485760,-1) and now are (-1,-1);
> RLIMIT_INFINITY=-1
>
> This did not fix the problem.
>
> Has anyone encountered problems like this on a system such as ours?
> Any suggestions as to what the solution might be?
>
> Thanks for your time and attention,
>
> ..Greg Poole
>
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
Received on 2008-02-18 10:40:54

This archive was generated by hypermail 2.3.0 : 2022-09-01 14:03:42 CEST