Domain decomposition dominating computational cost

Dear gadget-list members,

When I am running cosmological simulations with Gadget-4 I notice that
the domain decomposition becomes a dominant part of the computational cost.

I am running a simulation with 256³ gas and DM particles on 44 Intel
Xeon (e5-2699a v4) nodes, and the domain decomposition keeps getting
overwhelmingly expensive (20%-50% of the total CPU time). Curiously with
the same settings but only 2x128³ particles it only takes about 10%.
 From a quick glance at the code-paper I would have expected the
opposite behavior.

Attached are the config-options I compiled with and my parameters.
I would be very grateful if someone has any suggestions or comments
about how to improve or understand this behaviour.


