Home Company

Performance benchmarks

Introduction

As DSMC is computationally expensive, the performance of the code is thoroughly measured and optimized. It is very important to measure the performance of the code on different GPUs in order to make wise decisions when buying hardware. There are different vendors of GPU hardware (AMD, NVIDIA and Intel) where each vendor offers multiple classes of GPUs with different form factors and price tags. The code runs under linux and Windows. The available hardware ranges from workstations with one or more GPUs to GPU clulsters with multiple GPUs per node. Alltogether, there is a huge number of possible configurations to run PI-DSMC.

The performance of the code is measured in processed molecules per second. This includes the movement of molecules and the collisions between molecules. When multiple identical GPUs are available, two types of scaling tests are performed in order to studdy the parallel efficiency of the code: strong scaling and weak scaling.
Strong scaling measures how efficiently a problem of fixed size can be solved in shorter time by adding more resourses (GPUs). The ideal behaviour for this case is a linear increase of the performance with the number of GPUs. In an ideal case, the time required to obtain the solution is inversely proportionbal to the number of GPUs.
Weak scaling measures how effficiently a problem with a constant amount of work per GPU can be solved. The ideal here is constant duration with increasing total problem size and number of used GPUs. In an ideal case for example, a simulation would take the same time for n * k molecues on n GPUs as a simulation with k molecules on a single GPU.



Scaling tests

Scaling tests were conducted on a cluster consisting of four compute nodes. Each node contains three Radeon Pro Duo, i.e. 6 ASICS per node. The CPU is an Intel Xeon E5-2680 V3 and the nodes are connected via 40Gb infiniband. The test case is a resting gas with collisions. The number of particles is chosen such that the memory on each GPU is utilized almost completely. The image below shows the results of the weak scaling test where the total amount of work is increasing proportionally with the number of GPUs. The speedup is close to perfect, i.e. a larger problem can be solved in constant time by increasing the number of GPUs proportionally to the problem size.


The image below shows the results of the strong scaling test where a simulation of fixed size, i.e constant number of molecules and time steps, is performed with an increasing number of GPUs. The parallel efficiency decreases with increasing number of GPUs as the overhead increases.



Single GPU benchmarks

The table below shows the performance obtained on various GPUs. The test case is a resting gas and the number of molecules is choosen to fill the memory almost completely.

Model Memory [GB] Chip Performance [1/s] OS Comments
AMD Vega VII 16 Vega 20 (gfx?) 4.82e8 Ubuntu 18.04.4 Rocm 2.10 dkms
AMD Radeon Pro Duo (Polaris) 2 x 16 2 x Polaris 10 (gfx803) 2.18e8 (1 ASIC)
3.99e8 (2 ASICS)
Ubuntu 18.04.4 Rocm 2.10 dkms
Radeon R9 390 8 Hawaii 2.11e8 OpenSuse 42.1 fglrx (obsolete)