RE: fof can't work in multiple node

From: HU, Rui <1155168718_at_link.cuhk.edu.hk>
Date: Wed, 28 Sep 2022 14:00:19 +0000

Hi Volker,

Thank you for your advice. I have tested for a few days. I fixed the problem by change #SLURM -N 10 -c 32 to #SLURM -n 320. I guess maybe I did not activate the correct compiler and mpi library.(I load the intel mpi library module but actually use gcc or something). Anyway, the bugs seem to be fixed.

By the way, I would also like to know whether ngenic option could use glass file. You do mention it in GADGET-4 manual, but I can't read glass file properly.

Best,
Rui

-----Original Message-----
From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
Sent: 2022年9月23日 15:21
To: Gadget General Discussion <gadget-list_at_MPA-Garching.MPG.DE>
Subject: Re: [gadget-list] fof can't work in multiple node


Hi Rui,

Thanks for sending me your detailed setup. I have run your job on our cluster (also using 2 nodes with 32 cores each). This worked fine.

You seem to use the most recent version from the repository based on the git-commit tag, but you have made some (presumably minor?) changes I guess. At least the CAMBNOLOG option is not standard, so I had to disable it, and run with PowerSpectrumType=0, ReNormalizeInputSpectrum=1 instead. But this presumably is unrelated to your problem.

The most likely explantion for your problem that I can offer based on past experience is that you are using a problematic/outdated/buggy MPI library. Which one are you using? Please try a recent version of OpenMPI, which is still the most reliable according to my experience.

A more remote possibility is the compiler. Which one are you using? Best to try with gcc, version 9.3 or later. If it's a compiler issue due to the optimizer (-O3), you may want to try -O0 to exclude this possibility.

Regards,
Volker


> On 21. Sep 2022, at 16:29, HU, Rui <1155168718_at_link.cuhk.edu.hk> wrote:
>
>
> Hi Volker,
>
> The full log has emailed you privately (because it's nearly 3MB)
> (please check your email box even the junk box?)
>
> The following are some key points:
>
> Code was compiled with the following settings:
> CAMBNOLOG
> CREATE_GRID
> FOF
> NGENIC=128
> NGENIC_2LPT
> NGENIC_FIX_MODE_AMPLITUDES
> PERIODIC
> PMGRID=256
> POWERSPEC_ON_OUTPUT
> SELFGRAVITY
> SUBFIND
>
> FOF: We shall first compute a group catalog for this snapshot file
> FOF: Begin to compute FoF group catalogue... (presently
> allocated=8.39948 MB)
> FOF: Comoving linking length: 329.526
> TREE: Full tree construction for all particles. (presently
> allocated=9.40948 MB)
> FOFTREE: Ngb-tree construction done. took 0.0125011 sec
> <numnodes>=5424.21 NTopnodes=585 NTopleaves=512
> FOF: Start linking particles (presently allocated=9.72656 MB)
> FOF: linking of small cells took 1.68094e-05 sec
> FOF: local links done (took 0.0284756 sec, avg-work=0.0215296, imbalance=1.28485).
> FOF: Marked=225356 out of the 2097152 primaries which are linked
> FOF: begin linking across processors (presently allocated=9.84052 MB)
> Code termination on task=10, function treefind_fof_primary(), file
> src/fof/fof_findgroups.cc, line 315: unexpected because in the present
> algorithm we are only allowed walk local branches
>
>
> Hope this could help.
>
> Bests,
> Rui
>
>
> -----Original Message-----
> From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
> Sent: 2022年9月19日 23:04
> To: Gadget General Discussion <gadget-list_at_MPA-Garching.MPG.DE>
> Subject: Re: [gadget-list] fof can't work in multiple node
>
>
> Hi Rui,
>
> No, the FOF option can be used with multiple nodes. The termination you encountered is odd and should not have happened.
>
> Could you send your complete configuration, ideally the full stdout of this run? Perhaps there is something unusual about your setup that is related to the problem. Without more information I cannot say much.
>
> Best,
> Volker
>
>> On 18. Sep 2022, at 05:07, HU, Rui <1155168718_at_link.cuhk.edu.hk> wrote:
>>
>> Hi all,
>>
>> I am trying to use gadge4 in the cluster, and I activate the FOF option in Config.sh. I try to use multiple nodes (each node contains one Xeon processor), the simulation comes out the error:
>> “Code termination on task= XXX, function treefind_fof_primary(), file src/fof/fof_findgroups.cc, line 315: unexpected because in the present algorithm we are only allowed walk local branches.”
>>
>> But it works with only one node. So does that error mean the fof algorithm can only be used in one node/cpu? Or are there any issues related parallel running?
>>
>> Best,
>> Rui
>>
>> -----------------------------------------------------------
>>
>> If you wish to unsubscribe from this mailing, send mail to
>> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe
>> gadget-list A web-archive of this mailing list is available here:
>> http://www.mpa-garching.mpg.de/gadget/gadget-list
>
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
>
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe
> gadget-list A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list




-----------------------------------------------------------

If you wish to unsubscribe from this mailing, send mail to minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list A web-archive of this mailing list is available here:
http://www.mpa-garching.mpg.de/gadget/gadget-list
Received on 2022-09-28 16:00:42

This archive was generated by hypermail 2.3.0 : 2023-01-10 10:01:33 CET