Multi-GPU simulation of the motorbike in OpenFOAM and SpeedIT technology

Multi-GPU simulations of the motorbike in OpenFOAM with SpeedIT technology

Vratis Ltd., Wroclaw, Poland

March 28, 2012

1. Objective

OpenFOAM® simulations take a significant amount of time leading to higher costs of simulations. GPGPU technology has a potential to overcome this problem. However, due to a limited memory of a single GPU card, realistic simulations may be not possible. As a solution to this problem we propose to use a SpeedIT Multi-GPU technology where we accelerate calculation of pressure equation, which usually takes most of the time in simulations of incompressible flows. We compare the performance of SpeedIT Multi-GPU to standard OpenFOAM runs on CPU in various test scenarios for up to 32 millions cells on clusters with up to 16 GPU cards.

2. Methodology

SpeedIT is a library that implements iterative solvers on GPU using MPI to exchange data between domains. SpeedIT Plugin to OpenFOAM® was used to call GPU-accelerated iterative solvers in OpenFOAM which was responsible for decomposition of the case. Preliminary tests (see the report) for cavity3D performed at PLGRID cluster with varying number of cells showed (see Fig.1-2) that technology has a potential in reducing the simulation time. The tests performed at CINECA cluster aimed at solving larger simulations, with geometries up to 80M cells as well as testing a more efficient preconditioners, such as AMG. CINECA PLX cluster was equipped with 548 Intel Xeon E5645 and 548 Tesla M2070 cards with 6GB memory and 448 CUDA cores. Following test was performed in both multi-CPU and multi-GPU environment for a fixed number of time steps:

  1. 80M case, a ramp, simpleFoam.
  2. 32M cells motorBike test, simpleFoam.
Figure 1: Acceleration defined as a ratio nGPU vs. nCPU for different cavity3D runs with icoFoam and diagonal preconditioner.
Figure 2: Acceleration defined as a ratio nGPU vs. nCPU for AhmedBody and Cabin runs with simpleFoam.


Results

Tests performed on CINECA cluster were inspired by industry. First one was delivered by one of government agencies. It was a ramp and had 80M cells. We used PLX cluster to decompose the case and mesh it. Unfortunately, due to technical issues we were not able to run the simulations yet.

Second test was a standard OpenFOAM test, called motorBike modified by SGI so that it had 32 million cells. Test were performed in multi-GPU and multi-CPU environment. For the time being SpeedIT library can offer CG solver with diagonal preconditioner for the solution of the pressure equation on GPU. For the computations on CPU we also used the CG solver with diagonal preconditioner. We also used GAMG solver since it is mostly used in real life simulations. Results are presented in Fig. 3. As one can see computation on GPUs can be up to 8 times faster comparing to computations on the same number of processor cores. When compared against GAMG solver SpeedIT multi-GPU can also provide acceleration of factor x1.1-x1.4 (without the first time step, the acceleration was x1.5).

Figure 3: Acceleration defined as a ratio nGPU vs. nCPU for motorBike against diagonal preconditioner and CG or GAMG solvers used to solve the pressure equation.

Acknowledgments

We kindly acknowledge NVIDIA and CINECA for the support in performing the simulations as well as SGI for providing the test case.

Disclaimer

  1. This offering is not approved or endorsed by OpenCFD Limited, the producer of the OpenFOAM software and owner of the OPENFOAM®  and OpenCFD®  trade marks (see the Disclaimer).
  2. The views and statements expressed in this blog are of Vratis Ltd. and are not necessarily the views of or endorsement by 3rd parties named in this activity.
  3. OPENFOAM®  is a registered trade mark of OpenCFD Limited, the producer of the OpenFOAM software.

Multi-GPU tests in OpenFOAM

Introduction

Usually OpenFOAM® simulations take a significant amount of time leading to higher costs of prototyping. GPU technology could potentially overcome this problem, however due to a limited memory on a GPU card realistic simulations are usually not possible. To overcome this problem we have implemented a multi-GPU version of SpeedIT where we accelerate Preconditioned Conjugate Gradient solver to calculate the pressure. Indeed, it is not uncommon that solving pressure equations takes more time that solving equations of momentum.

Here, you will find the results from our tests performed for various OpenFOAM® simulations such as icoFoam and simpleFoam.  The tests were either created by us or provided by our partners Engys and IconCFD.

Methodology

SpeedIT implements SpMV in multi-GPU systems. SpeedIT Plugin was used to call SpeedIT from OpenFOAM®. Following tests were performed in both multi-CPU and multi-GPU environment. Since the quality of SpeedIT was already tested (see our previous reports) we have performed the tests for a fixed number of time steps.

  1. Cavity3D with varying number of cells, icoFoam, PCG on both CPU and GPU.
  2. AhmedBody, 2.2M cells, simpleFoam. (the test was prepared by Engys).
  3. Cabin, 1.5M cells, simpleFoam (the test was prepared by IconCFD).

The tests were performed on a cluster with Intel Xeon X5650, 2,67GHz and Tesla M2050 with 3GB memory and 448 CUDA cores.

Results

The following graphs present the performance of all three cases in multi-CPU and multi-GPU systems. The first column covers cavity3D case solved with icoFoam while the second one AhmedBody and Cabin tests solved with simpleFoam. Figs.1-2 present an acceleration as ratio nCPU vs. nGPU where n varied from 1 to 16. One can conclude what happens when calculations instead in a standard cluster are performed on a cluster equipped with GPU cards. Figs. 3-4 present the ratio 1CPU vs. nGPU. The graphs present a scenario when the calculations have been ported from a PC to a multi-GPU system.
Last Figs. 5-6 depict two comparisons, namely nCPU vs. 1CPU (cluster) and nGPU vs. 1GPU. From these scaling factors one can coclude if it is reasonable to add more CPUs or GPUs to the system.

Validation

To validate the results we have calculated the precision defined as a difference between p, U and phi for all the cells in the geometry calculated on CPU and GPU.

For example for cavity3D, 32M cells, 5000 time steps, nGPUs vs. nCPUs where n = 2,4,12,16,18 the precision was 1e-4, 1e-7 and 1e-14 for p, U and phi, respectively.

Conclusions

Results lead to following conclusions:

  1. The bigger the case the better acceleration on GPU. The scaling of 1M and 8M cases in Fig.1, Fig.2 indirectly confirms this hypothesis.
  2. It is reasonable to accelerate OpenFOAM® simulations with GPU cards. Even a standard PC equipped with a GPU card(s) should perform better.
  3. Also a cluster equipped with additional GPU cards should also provide acceleration, especially for larger test cases (see Fig.3 and 8M case).

Future plans

Solving pressure equation was performed with PCG with a diagonal preconditioner. Numerous publications and tests shows that GAMG provides faster converging and is usually used by OpenFOAM® users. In future, we plan to replace a diagonal preconditioner with alternative ones that converge faster.

Acknowledgments

We kindly would like to acknowledge Engys Ltd. and IconCFD Ltd. for providing us with the test cases. The implementation of a multi-GPU version of SpeedIT was partly funded by Green Transfer Programme from Wroclaw City financed by European Social Fund. This research was supported in part by PL-Grid Infrastructure.

Contact: sales (at) vratis.com and info (at) vratis.com

Disclaimer

  1. This offering is not approved or endorsed by OpenCFD Limited, the producer of the OpenFOAM software and owner of the OPENFOAM®  and OpenCFD®  trade marks (see the Disclaimer).
  2. The views and statements expressed in this blog are of Vratis Ltd. and are not necessarily the views of or endorsement by 3rd parties named in this activity.
  3. OPENFOAM®  is a registered trade mark of OpenCFD Limited, the producer of the OpenFOAM software.


Performance of SpMV in CUSPARSE, CUSP and SpeedIT

Introduction

Sparse Matrix-Vector multiplication  (SpMV) is one of BLAS operations that are often used in scientific calculations. In order to show that SpeedIT belongs to the fastest implementations of this routine we have tested SpMV on 23 randomly chosen matrices from University Florida Matrix Collection.  Their properties are described in Tab.1.  Tab 2 and Tab.3 present time of SpMV in single and double precision while Figs.1-8 present the results in a graphical form. Since the performance is strongly affected by the matrix size we have divided them into two groups: small and large matrices. The tests were performed on a Tesla C2050 GPU card from NVIDIA.

SpeedIT is available in two formats. CSR and a proprietary CMR format, either of which can be easily chosen by the user.

Fig.1. Average time of SPMV in Single Precision for small matrices.Time of SpMV in Single Precision for small matrices. Resulting time is an average from 1000 runs. Fig.1. Time of SpMV in Single Precision for small matrices. Resulting time is an average from 1000 runs.

Fig.2. Average time of SPMV in Single Precision for large matrices.Time of SpMV in Single Precision for large matrices. Resulting time is an average from 1000 runs.

Fig.2. Time of SpMV in Single Precision for large matrices. Resulting time is an average from 1000 runs.

Fig.3. Average time of SPMV in Double Precision for small matrices.Time of SpMV in Double Precision for small matrices. Resulting time is an average from 1000 runs.

Fig.4. Time of SpMV in Double Precision for small matrices. Resulting time is an average from 1000 runs.

Fig.4. Average time of SPMV in Double Precision for large matrices.Time of SpMV in Double Precision for large matrices. Resulting time is an average from 1000 runs.Fig.4. Time of SpMV in Double Precision for large matrices. Resulting time is an average from 1000 runs.

Fig. 5. Speed-up of SpeedIT CMR vs. CUSPARSE and CUSP in Single Precision.Performance ratio of SPMV algorithm in single precision from SpeedIT CMR versus other algorithmFig.5. Speed-up of SpeedIT CMR in Single Precision vs. CUSPARSE and CUSP.

Fig. 6. Speed-up of SpeedIT CSR vs. CUSPARSE and CUSP in Single Precision. Performance ratio of SPMV algorithm in single precision from SpeedIT CSR versus other algorithmFig.6. Speed-up of SpeedIT CSR in Single Precision vs. CUSPARSE and CUSP.

Fig. 7. Speed-up of SpeedIT CMR vs. CUSPARSE and CUSP in Double Precision.Performance ratio of SPMV algorithm in double precision from SpeedIT CMR versus other algorithmFig.7. Speed-up of SpeedIT CMR in Double Precision vs. CUSPARSE and CUSP.

Fig.8. Speed-up of SpeedIT CSR vs. CUSPARSE and CUSP in Double Precision. Performance ratio of SPMV algorithm in double precision from SpeedIT CSR versus other algorithmFig.8. Speed-up of SpeedIT CSR in Double Precision vs. CUSPARSE and CUSP.

Conclusions

  • The highest speed-up of SpMV implemented in SpeedIT CMR vs. CUSPARSE is about 2x while vs. CUSP is more than 4x.
  • The highest speed-up of SpMV implemented in SpeedIT CSR against  and CUSP is about 1.4x.
  • SpeedIT performs better for large matrices ( > 100 000 NNZ) and CMR format is more efficient.

Appendix

Tab 1. Description of matrix properties used in SpMV tests. NNZ and NZ correspond to the number of non-zero and zero elements. Small matrices are depicted in green. Remaining matrices are termed large in the following tests.

DESCRIPTION OF MATRIX PROPERTIES USED IN SPMV TESTS. NNZ AND NZ CORRESPOND TO THE NUMBER OF NON-ZERO AND ZERO ELEMENTS. SMALL MATRICES ARE DEPICTED IN GREEN. REMAINING MATRICES ARE TERMED LARGE IN THE FOLLOWING TESTS.


Tab. 2 Time of SpMV in Single Precision for CUSPARSE, CUSP and SPEEDIT in two available formats.

TIME OF SPMV IN SINGLE PRECISION FOR CUSPARSE, CUSP AND SPEEDIT IN TWO AVAILABLE FORMATS.

Tab. 3 Time of SpMV in Double Precision for CUSPARSE, CUSP and SpeedIT in two available formats.

TIME OF SPMV IN DOUBLE PRECISION FOR CUSPARSE, CUSP AND SPEEDIT IN TWO AVAILABLE FORMATS.