Re: Gadget 4 Single/double precision performance

From: Volker Springel <vspringel_at_MPA-Garching.MPG.DE>
Date: Tue, 1 Dec 2020 20:14:19 +0100

Hi Tiago,

when you use single-precision to store particle data, you can save some memory (but sacrifice a bit of precision). I'm not sure whether this is what you intended to obtain with DOUBLEPRECISION=0... If this was your goal, then you need to refrain from setting DOUBLEPRECISION at all to obtain this outcome. The settings DOUBLEPRECISION=0 and DOUBLEPRECISION=1 are identical and both switch on storage of particle data in double precision.

Only DOUBLEPRECISION=2 is different, which creates mixed precision, with some more critical variables in double, others in single.

Most internal calculations are done in double precision indepdendent of what you use for DOUBLEPRECISION.

If you use USE_SINGLEPRECISION_INTERNALLY, then this is however not the case, and in particular, the matrix/vector calculations in FMM are done consistently in single precision everywhere, avoiding all accidental promotions to double (which are already triggered, for example, by multiplying with a double constant like 0.5). One may hope that this has a positive speed impact in case the compiler can emit vector instructions which can pack more single-precision instructions together. This is why I had added this option. In practice however, this appears hardly ever to be the case in the current code, and for scalar operations, single precision is of the same speed as double in most current x86 CPUs, in fact, single can even be slower in case one needs to do extra conversions type casts to double, or from double to single precision... There is no memory impact by USE_SINGLEPRECISION_INTERNALLY. So in essence, you only loose accuracy when using this option, with little if any gain in speed, which makes this opti
on quite unattractive/useless in pratice.

When you set LEAN, you can realize a few additional memory savings, likewise when using POSITIONS_IN_32BIT and/or IDS_32BIT.

Regards,
Volker

> On 1. Dec 2020, at 08:15, Tiago Castro <tiagobscastro_at_gmail.com> wrote:
>
> Dear list,
>
> I am searching for the optimal configuration for running Gadget4. I am running control DMO simulations of 500 Mpc and 512^3 particles. I am puzzled by the following, running the code with/out USE_SINGLEPRECISION_INTERNALLY (config files pasted bellow) seems not to affect both the execution time and the memory consumption (memory.txt pasted bellow). However, I observe a rather small suppression (0.05%) of the matter power spectrum at z=0.0 for modes larger than unity. Is it due to the LEAN configuration? Should LEAN configuration affect the code accuracy as well? I warmly appreciate any clarification you can provide.
>
> Cheers,
> ---------------------- SINGLE PRECISION --------------------------
> Code was compiled with the following settings:
> ASMTH=1.25
> CREATE_GRID
> DOUBLEPRECISION=0
> FMM
> FOF
> FOF_GROUP_MIN_LEN=100
> FOF_LINKLENGTH=0.2
> FOF_PRIMARY_LINK_TYPES=2
> HIERARCHICAL_GRAVITY
> IMPOSE_PINNING
> LEAN
> MERGERTREE
> MULTIPOLE_ORDER=2
> NGENIC=512
> NGENIC_2LPT
> NSOFTCLASSES=1
> NTAB=128
> NTYPES=6
> OUTPUT_TIMESTEP
> PERIODIC
> PMGRID=512
> POWERSPEC_ON_OUTPUT
> RANDOMIZE_DOMAINCENTER
> RCUT=6.0
> SELFGRAVITY
> SUBFIND
> SUBFIND_HBT
> TREE_NUM_BEFORE_NODESPLIT=4
> USE_SINGLEPRECISION_INTERNALLY
>
> MEMORY: Largest Allocation = 1559.32 Mbyte | Largest Allocation Without Generic = 1201.79 Mbyte
>
> -------------------------- Allocated Memory Blocks---- ( Step 0 )------------------
> Task Nr F Variable MBytes Cumulative Function|File|Linenumber
> ------------------------------------------------------------------------------------------
> 23 0 0 GetGhostRankForSimulCommRank 0.0006 0.0006 mymalloc_init()|src/data/mymalloc.cc|137
> 23 1 0 GetShmRankForSimulCommRank 0.0006 0.0012 mymalloc_init()|src/data/mymalloc.cc|138
> 23 2 0 GetNodeIDForSimulCommRank 0.0006 0.0018 mymalloc_init()|src/data/mymalloc.cc|139
> 23 3 0 SharedMemBaseAddr 0.0003 0.0021 mymalloc_init()|src/data/mymalloc.cc|153
> 23 4 1 slab_to_task 0.0020 0.0041 my_slab_based_fft_init()|src/pm/pm_mpi_fft.cc|45
> 23 5 1 slabs_x_per_task 0.0006 0.0047 my_slab_based_fft_init()|src/pm/pm_mpi_fft.cc|60
> 23 6 1 first_slab_x_of_task 0.0006 0.0053 my_slab_based_fft_init()|src/pm/pm_mpi_fft.cc|63
> 23 7 1 slabs_y_per_task 0.0006 0.0059 my_slab_based_fft_init()|src/pm/pm_mpi_fft.cc|66
> 23 8 1 first_slab_y_of_task 0.0006 0.0065 my_slab_based_fft_init()|src/pm/pm_mpi_fft.cc|69
> 23 9 1 P 175.0443 175.0508 allocate_memory()|src/ngenic/../main/../data/simparticles|273
> 23 10 1 SphP 0.0001 175.0509 allocate_memory()|src/ngenic/../main/../data/simparticles|274
> 23 11 1 FirstTopleafOfTask 0.0006 175.0515 domain_allocate()|src/domain/domain.cc|163
> 23 12 1 NumTopleafOfTask 0.0006 175.0521 domain_allocate()|src/domain/domain.cc|164
> 23 13 1 TopNodes 0.0358 175.0879 domain_allocate()|src/domain/domain.cc|165
> 23 14 1 TaskOfLeaf 0.0156 175.1035 domain_allocate()|src/domain/domain.cc|166
> 23 15 1 ListOfTopleaves 0.0156 175.1191 domain_decomposition()|src/domain/domain.cc|118
> 23 16 1 PS 87.5222 262.6413 create_snapshot_if_desired()|src/main/run.cc|534
> 23 17 0 MinID 3.5000 266.1413 fof_fof()|src/fof/fof.cc|71
> 23 18 0 MinIDTask 3.5000 269.6413 fof_fof()|src/fof/fof.cc|72
> 23 19 0 Head 3.5000 273.1413 fof_fof()|src/fof/fof.cc|73
> 23 20 0 Next 3.5000 276.6413 fof_fof()|src/fof/fof.cc|74
> 23 21 0 Tail 3.5000 280.1413 fof_fof()|src/fof/fof.cc|75
> 23 22 0 Len 3.5000 283.6413 fof_fof()|src/fof/fof.cc|76
> 23 23 1 Send_count 0.0006 283.6419 treeallocate()|src/tree/tree.cc|794
> 23 24 1 Send_offset 0.0006 283.6425 treeallocate()|src/tree/tree.cc|795
> 23 25 1 Recv_count 0.0006 283.6431 treeallocate()|src/tree/tree.cc|796
> 23 26 1 Recv_offset 0.0006 283.6437 treeallocate()|src/tree/tree.cc|797
> 23 27 0 TreeNodes_offsets 0.0003 283.6440 treeallocate()|src/tree/tree.cc|824
> 23 28 0 TreePoints_offsets 0.0003 283.6443 treeallocate()|src/tree/tree.cc|825
> 23 29 0 TreeNextnode_offsets 0.0003 283.6447 treeallocate()|src/tree/tree.cc|826
> 23 30 0 TreeForeign_Nodes_offsets 0.0003 283.6450 treeallocate()|src/tree/tree.cc|827
> 23 31 0 TreeForeign_Points_offsets 0.0003 283.6453 treeallocate()|src/tree/tree.cc|828
> 23 32 0 TreeP_offsets 0.0003 283.6456 treeallocate()|src/tree/tree.cc|829
> 23 33 0 TreeSphP_offsets 0.0003 283.6459 treeallocate()|src/tree/tree.cc|830
> 23 34 0 TreePS_offsets 0.0003 283.6462 treeallocate()|src/tree/tree.cc|831
> 23 35 0 TreeSharedMemBaseAddr 0.0003 283.6465 treeallocate()|src/tree/tree.cc|833
> 23 36 1 Nodes 15.3964 299.0428 treeallocate()|src/tree/tree.cc|882
> 23 37 1 Points 0.0001 299.0429 treebuild_construct()|src/tree/tree.cc|311
> 23 38 1 Nextnode 3.5167 302.5596 treebuild_construct()|src/tree/tree.cc|312
> 23 39 1 Father 3.5010 306.0606 treebuild_construct()|src/tree/tree.cc|313
> 23 40 0 Flags 0.8750 306.9356 fof_find_groups()|src/fof/fof_findgroups.cc|127
> 23 41 0 FullyLinkedNodePIndex 0.5178 307.4534 fof_find_groups()|src/fof/fof_findgroups.cc|129
> 23 42 0 targetlist 3.5000 310.9534 fof_find_groups()|src/fof/fof_findgroups.cc|163
> 23 43 0 Exportflag 0.0006 310.9540 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|593
> 23 44 0 Exportindex 0.0006 310.9546 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|594
> 23 45 0 Exportnodecount 0.0006 310.9552 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|595
> 23 46 0 Send 0.0012 310.9564 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|597
> 23 47 0 Recv 0.0012 310.9576 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|598
> 23 48 0 Send_count 0.0006 310.9583 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|600
> 23 49 0 Send_offset 0.0006 310.9589 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|601
> 23 50 0 Recv_count 0.0006 310.9595 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|602
> 23 51 0 Recv_offset 0.0006 310.9601 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|603
> 23 52 0 Send_count_nodes 0.0006 310.9607 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|605
> 23 53 0 Send_offset_nodes 0.0006 310.9613 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|606
> 23 54 0 Recv_count_nodes 0.0006 310.9619 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|607
> 23 55 0 Recv_offset_nodes 0.0006 310.9625 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|608
> 23 56 1 PartList 1241.0233 1551.9858 src/fof/../mpi_utils/generic_comm.h|198generic_alloc_partlist_nodelist_ngblist()|src/fof/../mpi_utils/generic_comm.h|244
> 23 57 1 Ngblist 3.5000 1555.4858 src/fof/../mpi_utils/generic_comm.h|198generic_alloc_partlist_nodelist_ngblist()|src/fof/../mpi_utils/generic_comm.h|247
> 23 58 1 Shmranklist 3.5000 1558.9858 src/fof/../mpi_utils/generic_comm.h|198generic_alloc_partlist_nodelist_ngblist()|src/fof/../mpi_utils/generic_comm.h|248
> 23 59 1 DataIn 0.0001 1558.9859 src/fof/../mpi_utils/generic_comm.h|198generic_exchange()|src/fof/../mpi_utils/generic_comm.h|556
> 23 61 1 DataOut 0.0001 1558.9860 src/fof/../mpi_utils/generic_comm.h|198generic_exchange()|src/fof/../mpi_utils/generic_comm.h|558
> 23 62 0 rel_node_index 0.0006 1558.9866 src/fof/../mpi_utils/generic_comm.h|198generic_prepare_particle_data_for_expor()|src/fof/../mpi_utils/generic_comm.h|317
> ------------------------------------------------------------------------------------------
>
> ---------------------- DOUBLE PRECISION --------------------------
> Code was compiled with the following settings:
> ASMTH=1.25
> CREATE_GRID
> DOUBLEPRECISION=1
> FMM
> FOF
> FOF_GROUP_MIN_LEN=100
> FOF_LINKLENGTH=0.2
> FOF_PRIMARY_LINK_TYPES=2
> GADGET2_HEADER
> HIERARCHICAL_GRAVITY
> IMPOSE_PINNING
> LEAN
> MERGERTREE
> MULTIPOLE_ORDER=2
> NGENIC=512
> NGENIC_2LPT
> NSOFTCLASSES=1
> NTAB=128
> NTYPES=6
> OUTPUT_TIMESTEP
> PERIODIC
> PMGRID=512
> POWERSPEC_ON_OUTPUT
> RANDOMIZE_DOMAINCENTER
> RCUT=6.0
> SELFGRAVITY
> SUBFIND
> SUBFIND_HBT
> TREE_NUM_BEFORE_NODESPLIT=4
>
> MEMORY: Largest Allocation = 1559.32 Mbyte | Largest Allocation Without Generic = 1202.39 Mbyte
>
> -------------------------- Allocated Memory Blocks---- ( Step 0 )------------------
> Task Nr F Variable MBytes Cumulative Function|File|Linenumber
> ------------------------------------------------------------------------------------------
> 8 0 0 GetGhostRankForSimulCommRank 0.0006 0.0006 mymalloc_init()|src/data/mymalloc.cc|137
> 8 1 0 GetShmRankForSimulCommRank 0.0006 0.0012 mymalloc_init()|src/data/mymalloc.cc|138
> 8 2 0 GetNodeIDForSimulCommRank 0.0006 0.0018 mymalloc_init()|src/data/mymalloc.cc|139
> 8 3 0 SharedMemBaseAddr 0.0003 0.0021 mymalloc_init()|src/data/mymalloc.cc|153
> 8 4 1 slab_to_task 0.0020 0.0041 my_slab_based_fft_init()|src/pm/pm_mpi_fft.cc|45
> 8 5 1 slabs_x_per_task 0.0006 0.0047 my_slab_based_fft_init()|src/pm/pm_mpi_fft.cc|60
> 8 6 1 first_slab_x_of_task 0.0006 0.0053 my_slab_based_fft_init()|src/pm/pm_mpi_fft.cc|63
> 8 7 1 slabs_y_per_task 0.0006 0.0059 my_slab_based_fft_init()|src/pm/pm_mpi_fft.cc|66
> 8 8 1 first_slab_y_of_task 0.0006 0.0065 my_slab_based_fft_init()|src/pm/pm_mpi_fft.cc|69
> 8 9 1 P 175.0443 175.0508 allocate_memory()|src/ngenic/../main/../data/simparticles|273
> 8 10 1 SphP 0.0001 175.0509 allocate_memory()|src/ngenic/../main/../data/simparticles|274
> 8 11 1 FirstTopleafOfTask 0.0006 175.0515 domain_allocate()|src/domain/domain.cc|163
> 8 12 1 NumTopleafOfTask 0.0006 175.0521 domain_allocate()|src/domain/domain.cc|164
> 8 13 1 TopNodes 0.0358 175.0879 domain_allocate()|src/domain/domain.cc|165
> 8 14 1 TaskOfLeaf 0.0156 175.1035 domain_allocate()|src/domain/domain.cc|166
> 8 15 1 ListOfTopleaves 0.0156 175.1191 domain_decomposition()|src/domain/domain.cc|118
> 8 16 1 PS 87.5222 262.6413 create_snapshot_if_desired()|src/main/run.cc|534
> 8 17 0 MinID 3.5000 266.1413 fof_fof()|src/fof/fof.cc|71
> 8 18 0 MinIDTask 3.5000 269.6413 fof_fof()|src/fof/fof.cc|72
> 8 19 0 Head 3.5000 273.1413 fof_fof()|src/fof/fof.cc|73
> 8 20 0 Next 3.5000 276.6413 fof_fof()|src/fof/fof.cc|74
> 8 21 0 Tail 3.5000 280.1413 fof_fof()|src/fof/fof.cc|75
> 8 22 0 Len 3.5000 283.6413 fof_fof()|src/fof/fof.cc|76
> 8 23 1 Send_count 0.0006 283.6419 treeallocate()|src/tree/tree.cc|794
> 8 24 1 Send_offset 0.0006 283.6425 treeallocate()|src/tree/tree.cc|795
> 8 25 1 Recv_count 0.0006 283.6431 treeallocate()|src/tree/tree.cc|796
> 8 26 1 Recv_offset 0.0006 283.6437 treeallocate()|src/tree/tree.cc|797
> 8 27 0 TreeNodes_offsets 0.0003 283.6440 treeallocate()|src/tree/tree.cc|824
> 8 28 0 TreePoints_offsets 0.0003 283.6443 treeallocate()|src/tree/tree.cc|825
> 8 29 0 TreeNextnode_offsets 0.0003 283.6447 treeallocate()|src/tree/tree.cc|826
> 8 30 0 TreeForeign_Nodes_offsets 0.0003 283.6450 treeallocate()|src/tree/tree.cc|827
> 8 31 0 TreeForeign_Points_offsets 0.0003 283.6453 treeallocate()|src/tree/tree.cc|828
> 8 32 0 TreeP_offsets 0.0003 283.6456 treeallocate()|src/tree/tree.cc|829
> 8 33 0 TreeSphP_offsets 0.0003 283.6459 treeallocate()|src/tree/tree.cc|830
> 8 34 0 TreePS_offsets 0.0003 283.6462 treeallocate()|src/tree/tree.cc|831
> 8 35 0 TreeSharedMemBaseAddr 0.0003 283.6465 treeallocate()|src/tree/tree.cc|833
> 8 36 1 Nodes 15.3964 299.0428 treeallocate()|src/tree/tree.cc|882
> 8 37 1 Points 0.0001 299.0429 treebuild_construct()|src/tree/tree.cc|311
> 8 38 1 Nextnode 3.5167 302.5596 treebuild_construct()|src/tree/tree.cc|312
> 8 39 1 Father 3.5010 306.0606 treebuild_construct()|src/tree/tree.cc|313
> 8 40 0 Flags 0.8750 306.9356 fof_find_groups()|src/fof/fof_findgroups.cc|127
> 8 41 0 FullyLinkedNodePIndex 0.5178 307.4534 fof_find_groups()|src/fof/fof_findgroups.cc|129
> 8 42 0 targetlist 3.5000 310.9534 fof_find_groups()|src/fof/fof_findgroups.cc|163
> 8 43 0 Exportflag 0.0006 310.9540 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|593
> 8 44 0 Exportindex 0.0006 310.9546 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|594
> 8 45 0 Exportnodecount 0.0006 310.9552 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|595
> 8 46 0 Send 0.0012 310.9564 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|597
> 8 47 0 Recv 0.0012 310.9576 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|598
> 8 48 0 Send_count 0.0006 310.9583 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|600
> 8 49 0 Send_offset 0.0006 310.9589 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|601
> 8 50 0 Recv_count 0.0006 310.9595 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|602
> 8 51 0 Recv_offset 0.0006 310.9601 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|603
> 8 52 0 Send_count_nodes 0.0006 310.9607 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|605
> 8 53 0 Send_offset_nodes 0.0006 310.9613 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|606
> 8 54 0 Recv_count_nodes 0.0006 310.9619 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|607
> 8 55 0 Recv_offset_nodes 0.0006 310.9625 generic_allocate_comm_tables()|src/fof/../mpi_utils/generic_comm.h|608
> 8 56 1 PartList 1241.0233 1551.9858 src/fof/../mpi_utils/generic_comm.h|198generic_alloc_partlist_nodelist_ngblist()|src/fof/../mpi_utils/generic_comm.h|244
> 8 57 1 Ngblist 3.5000 1555.4858 src/fof/../mpi_utils/generic_comm.h|198generic_alloc_partlist_nodelist_ngblist()|src/fof/../mpi_utils/generic_comm.h|247
> 8 58 1 Shmranklist 3.5000 1558.9858 src/fof/../mpi_utils/generic_comm.h|198generic_alloc_partlist_nodelist_ngblist()|src/fof/../mpi_utils/generic_comm.h|248
> 8 59 1 DataIn 0.0001 1558.9859 src/fof/../mpi_utils/generic_comm.h|198generic_exchange()|src/fof/../mpi_utils/generic_comm.h|556
> 8 60 1 NodeInfoIn 0.0001 1558.9860 src/fof/../mpi_utils/generic_comm.h|198generic_exchange()|src/fof/../mpi_utils/generic_comm.h|557
> 8 61 1 DataOut 0.0001 1558.9860 src/fof/../mpi_utils/generic_comm.h|198generic_exchange()|src/fof/../mpi_utils/generic_comm.h|558
> 8 62 0 rel_node_index 0.0006 1558.9866 src/fof/../mpi_utils/generic_comm.h|198generic_prepare_particle_data_for_expor()|src/fof/../mpi_utils/generic_comm.h|317
> ------------------------------------------------------------------------------------------
>
> Tiago Castro Post Doc, Department of Physics / UNITS / OATS
> Phone: (+39 040 3199 120)
> Mobile: (+39 388 794 1562)
> Email: tiagobscastro_at_gmail.com
> Website: tiagobscastro.com
> Skype: tiagobscastro
> Address: Osservatorio Astronomico di Trieste / Villa Bazzoni
> Via Bazzoni, 2, 34143 Trieste TS
>
>
>
> -----------------------------------------------------------
>
> If you wish to unsubscribe from this mailing, send mail to
> minimalist_at_MPA-Garching.MPG.de with a subject of: unsubscribe gadget-list
> A web-archive of this mailing list is available here:
> http://www.mpa-garching.mpg.de/gadget/gadget-list
Received on 2020-12-01 20:14:20

This archive was generated by hypermail 2.3.0 : 2022-09-01 14:03:43 CEST