Re: Domain decomposition dominating computational cost

From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
Date: Mon, 14 Jun 2021 15:31:08 +0200

Dear Leonard,

About the CPU cost of the domain decomposition: The behaviour you found isn't normal, so something odd appears to be going on. I have run your setup and got a fraction of 4.8% down to redshift z=1, for the 2 x 256^3 problem size. (Incidentally, this could be reduced if desired by increasing the setting of the ActivePartFracForNewDomainDecomp parameter.)

The result you got for G4's HEALTHTEST feature is indeed curious. I can confirm that there is no simple swap of 'intra' and 'inter', i.e. "Internode" reports the time for a communication test where only one MPI-rank on each node participates, while "Intranode cube" does one where only MPI-ranks placed on the first node communicate amongst each other.

While it is possible that 'internode' comes out on top of 'intranode' in this test, the intranode should in any case be very fast (a couple of thousand MB/sec) due to the ability of MPI to do shared-memory communication in this case, independent of the communication backplane. The fact that your result shows dismal performance for the intranode test is quite strange; this can easily be behind the performance problems you noticed for the domain decomposition.

Are you perhaps using hyperthreading? The e5-2699a-v4 CPU has 22 physical cores as far as I know, but you mention that you use 44(?) cores per CPU. Hyperthreading could potentially lead to a kind of self-blocking of the MPI library. Other than that, you could also suffer from incorrect pinning in principle, something that you could check and correct with
Or maybe your MPI library that for some reason doesn't go through shared-memory for intranode communication.

Best reagrds,

> On 13. Jun 2021, at 22:02, Leonard Romano <> wrote:
> Dear gadget-list members,
> In the course of furthering my understanding of the issue I performed a health-test using the available config option provided with G4.
> Attached is the corresponding log output for a run, on a representative subset of the available nodes of the cluster I am working on.
> Looking at these results it seems very surprising for me, that the data transmission rate for the intranode cube is so much lower than the one for the internode cube. This makes me wonder if maybe in the code the words "inter" and "intra" were swapped (I would expect "intra" to mean within a node and "inter" between separate nodes). The resulting full hypercube communication seems very slow too, so overall, I think there might be a problem with the machines anyways, but it would be helpful if someone could clear up this issue, so I can provide accurate information to the system administration.
> Best regards,
> Leonard Romano
> On 09.06.21 17:09, Leonard Romano wrote:
>> Dear gadget-list members,
>> When I am running cosmological simulations with Gadget-4 I notice that the domain decomposition becomes a dominant part of the computational cost.
>> I am running a simulation with 256³ gas and DM particles on 44 Intel Xeon (e5-2699a v4) nodes, and the domain decomposition keeps getting overwhelmingly expensive (20%-50% of the total CPU time). Curiously with the same settings but only 2x128³ particles it only takes about 10%.
>> From a quick glance at the code-paper I would have expected the opposite behavior.
>> Attached are the config-options I compiled with and my parameters.
>> I would be very grateful if someone has any suggestions or comments about how to improve or understand this behaviour.
>> Best,
>> Leonard
> --
> ===================================================
> Leonard Romano, B.Sc.(レオナルド・ロマノ)
> Physics Department
> Technical University of Munich (TUM), Germany
> Theoretical Astrophysics Group
> Department of Earth and Space Science
> Graduate School of Science, Osaka University, Japan
> he / him / his
> ===================================================
> <Healthtest_LOG.txt>
> -----------------------------------------------------------
> If you wish to unsubscribe from this mailing, send mail to
> with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
Received on 2021-06-14 15:31:09

This archive was generated by hypermail 2.3.0 : 2023-01-10 10:01:32 CET