Re: Segmentation fault in tree calculation for multiple node runs

From: Ken Osato <ken.osato_at_iap.fr>
Date: Wed, 27 Jan 2021 18:16:30 +0100

Dear Volker,

Thank you for your help.
> At the moment I cannot yet reproduce it, but it smells like it is related to the shared memory allocation.
As you might already notice, I suspect the problem is caused by the MPI
library or compiler. It seems that Gadget-4 is tested with OpenMPI but
it might fail with specific versions of MPICH.

There are three programming environments (cray, gnu, intel; for
compilers, crayc++, mpicxx, mpiicpc) on my cluster but all failed due to
segmentation fault.
The strange thing is that only for cray compiler, the segmentation fault
occurs at different point. (GDB log is attached below.) But I think this
error is caused by the same problem.
All three environments utilize the MPI library of Cray Message Passing
Toolkit (MPT) v7.7.0, which is based on ANL MPICH 3.2, with Cray
Compiling Environment v8.6.5.

Best,
Ken


Core was generated by `./Gadget4 param.txt'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000004e7bbc in tree<gravnode, simparticles, gravpoint_data,
foreign_gravpoint_data>::treebuild_construct (this=0x7ffffffd8170)
     at src/tree/tree.cc:318
318          int index = NodeIndex[i];
(gdb) bt
#0  0x00000000004e7bbc in tree<gravnode, simparticles, gravpoint_data,
foreign_gravpoint_data>::treebuild_construct (this=0x7ffffffd8170)
     at src/tree/tree.cc:318
#1  0x00000000004e2be0 in tree<gravnode, simparticles, gravpoint_data,
foreign_gravpoint_data>::treebuild (this=0x7ffffffd8170, ninsert=28129,
     indexlist=0x0) at src/tree/tree.cc:75
#2  0x00000000004b59e5 in sim::gravity (this=0x7ffffffd7340, timebin=0)
at src/gravity/gravity.cc:226
#3  0x00000000004b6864 in sim::compute_grav_accelerations
(this=0x7ffffffd7340, timebin=0) at src/gravity/gravity.cc:110
#4  0x00000000004a5110 in sim::do_gravity_step_second_half
(this=0x7ffffffd7340) at src/time_integration/kicks.cc:379
#5  0x0000000000424a2a in sim::run (this=0x7ffffffd7340) at
src/main/run.cc:149
#6  0x000000000041bc6b in main (argc=2, argv=0x7fffffff6008) at
src/main/main.cc:327
(gdb) f 0
#0  0x00000000004e7bbc in tree<gravnode, simparticles, gravpoint_data,
foreign_gravpoint_data>::treebuild_construct (this=0x7ffffffd8170)
     at src/tree/tree.cc:318
318          int index = NodeIndex[i];
(gdb) list
313      Father   = (int *)Mem.mymalloc_movable(&Father, "Father",
(MaxPart + NumPartImported) * sizeof(int));
314
315      /* now put in markers ("pseudo" particles) in top-leaf nodes to
indicate on which task the branch lies */
316      for(int i = 0; i < D->NTopleaves; i++)
317        {
318          int index = NodeIndex[i];
319
320          if(TreeSharedMem_ThisTask == 0)
321            TopNodes[index].nextnode = MaxPart + MaxNodes + i;
322


On 26/01/2021 17:50, Volker Springel wrote:
> Dear Ken,
>
> Thanks a lot for reporting this problem. At the moment I cannot yet reproduce it, but it smells like it is related to the shared memory allocation.
>
> Could you let me know which MPI library (and which version) you're using? (Are there several MPI libraries on your system that you could try as well?) Which compiler are you using? (In case you don't know, the outputs of "which mpicc", "mpicc -v", and "ldd ./Gadget-4" should give some pointers)
>
> Best,
> Volker
>
>
>
>> On 24. Jan 2021, at 16:25, Ken Osato <ken.osato_at_iap.fr> wrote:
>>
>> Dear Gagdet-community,
>>
>> I'm working on running dark-matter only cosmological simulations with Gadget-4.
>> When I ran the code with the same Config.sh and param.txt of the example "DM-L50-N128", the code runs perfectly for single node, but for multi nodes, it fails due to segmentation fault.
>> I have been using L-Gadget-2 but never encountered such an error on the same cluster.
>> I analyzed the core file and it says segmentation fault occurs at the tree calculation. I suspect the memory allocation has something wrong when there are multiple shared memories.
>>
>> I've attached the log file when I ran the code with "DM-L50-N128" example setting on Cray XC50 with 2 nodes (= 80 cores) and the outputs of GDB in the following. Any help and suggestion are welcome. Thank you.
>>
>> Best regards,
>> Ken Osato
>>
>>
>> /* GDB outputs */
>> Core was generated by `./Gadget4 param.txt'.
>> Program terminated with signal SIGSEGV, Segmentation fault.
>> #0 0x00000000004b5a5e in tree<gravnode, simparticles, gravpoint_data, foreign_gravpoint_data>::treebuild_construct (this=0x7fffffff3870) at src/tree/tree.cc:324
>> 324 Nextnode[MaxPart + i] = TopNodes[index].sibling;
>> (gdb) bt
>> #0 0x00000000004b5a5e in tree<gravnode, simparticles, gravpoint_data, foreign_gravpoint_data>::treebuild_construct (this=0x7fffffff3870) at src/tree/tree.cc:324
>> #1 0x00000000004b0878 in tree<gravnode, simparticles, gravpoint_data, foreign_gravpoint_data>::treebuild (this=0x7fffffff3870, ninsert=24242, indexlist=0x0) at src/tree/tree.cc:75
>> #2 0x000000000048ac97 in sim::gravity (this=0x7fffffff2a40, timebin=0) at src/gravity/gravity.cc:226
>> #3 0x000000000048b8e5 in sim::compute_grav_accelerations (this=0x7fffffff2a40, timebin=0) at src/gravity/gravity.cc:110
>> #4 0x000000000047f4ea in sim::do_gravity_step_second_half (this=0x7fffffff2a40) at src/time_integration/kicks.cc:379
>> #5 0x000000000041911a in sim::run (this=0x7fffffff2a40) at src/main/run.cc:149
>> #6 0x000000000041631a in main (argc=2, argv=0x7fffffff58f8) at src/main/main.cc:327
>> (gdb) f 0
>> #0 0x00000000004b5a5e in tree<gravnode, simparticles, gravpoint_data, foreign_gravpoint_data>::treebuild_construct (this=0x7fffffff3870) at src/tree/tree.cc:324
>> 324 Nextnode[MaxPart + i] = TopNodes[index].sibling;
>> (gdb) list
>> 319
>> 320 if(TreeSharedMem_ThisTask == 0)
>> 321 TopNodes[index].nextnode = MaxPart + MaxNodes + i;
>> 322
>> 323 /* set nextnode for pseudo-particle (Nextnode exists on all ranks) */
>> 324 Nextnode[MaxPart + i] = TopNodes[index].sibling;
>> 325 }
>> 326
>> 327 point_data *export_Points = (point_data *)Mem.mymalloc("export_Points", NumPartExported * sizeof(point_data));
>> 328
>>
>> --
>> Ken Osato
>> Institut d'Astrophysique de Paris
>> 98bis boulevard Arago, 75014 Paris, France
>> Tel: +33 1 44 32 80 00
>> E-mail: ken.osato_at_iap.fr
>>
>> <DM-L50-N128.log>
>> -----------------------------------------------------------
>>
>> If you wish to unsubscribe from this mailing, send mail to
>> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
>> A web-archive of this mailing list is available here:
>> http://www.mpa-garching.mpg.de/gadget/gadget-list
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list

-- 
Ken Osato
Institut d'Astrophysique de Paris
98bis boulevard Arago, 75014 Paris, France
Tel: +33 1 44 32 80 00
E-mail: ken.osato_at_iap.fr
Received on 2021-01-27 18:16:50

This archive was generated by hypermail 2.3.0 : 2022-09-01 14:03:43 CEST