Re: Domain decomposition dominating computational cost from Leonard Romano on 2021-06-14 (GADGET General Discussion Mailing List)

From: Leonard Romano <leonard.romano_at_tum.de>
Date: Mon, 14 Jun 2021 18:08:59 +0200

Dear Volker,

Thank you for your very helpful comments and the clarification!
I tried again with IMPOSE_PINNING and IMPOSE_PINNING_OVERRIDE_MODE, and
it did only marginally improve the situation. Hyperthreading also
doesn't seem to be the problem, because with IMPOSE_PINNING I can see
that there are 44 physical and logical cores available.
I think this rules out most things related to Gadget-4 and points
towards issues with the setup of the machine.
I will check again with my system administrator.

Best regards,
Leonard

On 14.06.21 15:31, Volker Springel wrote:
>
> Dear Leonard,
>
> About the CPU cost of the domain decomposition: The behaviour you found isn't normal, so something odd appears to be going on. I have run your setup and got a fraction of 4.8% down to redshift z=1, for the 2 x 256^3 problem size. (Incidentally, this could be reduced if desired by increasing the setting of the ActivePartFracForNewDomainDecomp parameter.)
>
> The result you got for G4's HEALTHTEST feature is indeed curious. I can confirm that there is no simple swap of 'intra' and 'inter', i.e. "Internode" reports the time for a communication test where only one MPI-rank on each node participates, while "Intranode cube" does one where only MPI-ranks placed on the first node communicate amongst each other.
>
> While it is possible that 'internode' comes out on top of 'intranode' in this test, the intranode should in any case be very fast (a couple of thousand MB/sec) due to the ability of MPI to do shared-memory communication in this case, independent of the communication backplane. The fact that your result shows dismal performance for the intranode test is quite strange; this can easily be behind the performance problems you noticed for the domain decomposition.
>
> Are you perhaps using hyperthreading? The e5-2699a-v4 CPU has 22 physical cores as far as I know, but you mention that you use 44(?) cores per CPU. Hyperthreading could potentially lead to a kind of self-blocking of the MPI library. Other than that, you could also suffer from incorrect pinning in principle, something that you could check and correct with
> IMPOSE_PINNING
> IMPOSE_PINNING_OVERRIDE_MODE
> Or maybe your MPI library that for some reason doesn't go through shared-memory for intranode communication.
>
> Best reagrds,
> Volker
>
>> On 13. Jun 2021, at 22:02, Leonard Romano <leonard.romano_at_tum.de> wrote:
>>
>> Dear gadget-list members,
>>
>> In the course of furthering my understanding of the issue I performed a health-test using the available config option provided with G4.
>> Attached is the corresponding log output for a run, on a representative subset of the available nodes of the cluster I am working on.
>> Looking at these results it seems very surprising for me, that the data transmission rate for the intranode cube is so much lower than the one for the internode cube. This makes me wonder if maybe in the code the words "inter" and "intra" were swapped (I would expect "intra" to mean within a node and "inter" between separate nodes). The resulting full hypercube communication seems very slow too, so overall, I think there might be a problem with the machines anyways, but it would be helpful if someone could clear up this issue, so I can provide accurate information to the system administration.
>>
>> Best regards,
>> Leonard Romano
>>
>> On 09.06.21 17:09, Leonard Romano wrote:
>>> Dear gadget-list members,
>>>
>>> When I am running cosmological simulations with Gadget-4 I notice that the domain decomposition becomes a dominant part of the computational cost.
>>>
>>> I am running a simulation with 256³ gas and DM particles on 44 Intel Xeon (e5-2699a v4) nodes, and the domain decomposition keeps getting overwhelmingly expensive (20%-50% of the total CPU time). Curiously with the same settings but only 2x128³ particles it only takes about 10%.
>>> From a quick glance at the code-paper I would have expected the opposite behavior.
>>>
>>> Attached are the config-options I compiled with and my parameters.
>>> I would be very grateful if someone has any suggestions or comments about how to improve or understand this behaviour.
>>>
>>> Best,
>>> Leonard
>>>
>> --
>> ===================================================
>> Leonard Romano, B.Sc.（レオナルド・ロマノ）
>> Physics Department
>> Technical University of Munich (TUM), Germany
>> Theoretical Astrophysics Group
>> Department of Earth and Space Science
>> Graduate School of Science, Osaka University, Japan
>> he / him / his
>> ===================================================
>>
>> <Healthtest_LOG.txt>
>> -----------------------------------------------------------
>>
>> If you wish to unsubscribe from this mailing, send mail to
>> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
>> A web-archive of this mailing list is available here:
>> http://www.mpa-garching.mpg.de/gadget/gadget-list
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list

-- 
===================================================
Leonard Romano, B.Sc.（レオナルド・ロマノ）
Physics Department
Technical University of Munich (TUM), Germany
Theoretical Astrophysics Group
Department of Earth and Space Science
Graduate School of Science, Osaka University, Japan
he / him / his
===================================================

Received on 2021-06-14 18:09:01