SpeedIT Flow and IBM Power 8

1. Introduction

Today CFD simulations are becoming more and more computationally demanding. In many areas of science and industry there is a need to guarantee short turnaround times and fast time-to-market. Such goals can be fulfilled only with huge investments in hardware and software licenses.
Graphics Processing Units provide completely new possibilities for significant cost savings because simulation time can be reduced on hardware that is often less expensive than server-class CPUs. Almost every PC contains a graphics card that supports either CUDA or OpenCL.
SpeedIT Flow is one of the fastest Computational Fluid Dynamics (CFD) implicit single-phase flow solvers currently available. In contrary to other solutions a Semi-Implicit Method for Pressure Linked Equations (SIMPLE) and the Pressure Implicit with Operator Splitting (PISO) algorithms have been completely implemented on Graphics Processing Unit (GPU). The software is particularly useful in acceleration of time-consuming external and internal flow simulations, such as aerodynamics studies in car and aviation industries.
SpeedIT Flow robust solver technology empowers users by providing accuracy in double precision on fully unstructured meshes up to 11 million cells. Our implementation was validated on standard steady and unsteady industry-relevant problems with RANS turbulence model. Its computational efficiency has been evaluated against the modern CPU configurations using OpenFOAM
This technology may be particularly interesting for HPC centres, because our software offers better utilization of their resources. The computations may be done on the CPUs and GPUs concurrently. If there are multiple GPUs in the system, independent computing tasks such as in the parameter sweep studies, can be solved simultaneously. When cases are solved on GPU the CPU resources are free and can be used for other tasks such as pre- and post-processing. Moreover the power efficiency per simulation, which is an important factor in high performance computing, is comparable for a dual-socket multicore CPU and a GPU.
Innovation comes also from other CPU manufacturers like for example IBM with their newest hardware. POWER8 is a family of superscalar symmetric multiprocessors based on the Power Architecture. POWER8 is designed to be a massively multithreaded chip, with each of its cores capable of handling eight hardware threads simultaneously, for a total of 96 threads executed simultaneously on a 12-core chip. The processor makes use of very large amounts of on- and off-chip eDRAM caches, and on-chip memory controllers enable very high bandwidth to memory and system I/O.
In this article we will show the performance of those two solutions on the typical CFD studies and compare the results to the once obtained with a supercomputer at Ohio Supercomputing Center (OSC).

2. Hardware

Tests were done in three locations on di fferent hardware:

1) IBM Power Acceleration and Design Center Boeblingen (PADC)

  • processor: IBM S824L, 10 cores, 3.4 GHz,
  • OS: Ubuntu 14.04.

2) OSC supercomputer Ruby2 (Full speci fation can be found at https://www.osc.edu/supercomputing/hpc)

Figure 1: Test cases. From top left: AeroCar and SolarCar - geometries from 4-ID Network; motorBike - geometry from OpenFOAM tutorial; DrivAer - geometry from Institute of Aerodynamics and Fluid Mechanics at TUM. All geometry providers are kindly acknowledged.
Figure 1: Test cases. From top left: AeroCar and SolarCar – geometries from 4-ID Network; motorBike – geometry from OpenFOAM tutorial; DrivAer – geometry from Institute of Aerodynamics and Fluid Mechanics at TUM. All geometry providers are kindly acknowledged.

3) Vratis in-house cluster

  • GPU: NVIDIA Quadro K6000,
  • OS: Ubuntu 12.04.4.

 3. Test cases

To test the performance of those systems we have selected some typical cases for both: stationary and non-stationary flows. Simulations were done on the hardware given in Section 2. Simulations run on CPUs used OpenFOAM software. On Ruby cluster simulations were run on two processors (20 cores) per node. On IBM cluster computations were run on a single processor. Power8 processors are capable of running up to eight threads can be run on each core. For OpenFOAM the best results are obtained using four threads per core. As the used processor had ten cores the simulation case was run with forty processes. Simulations run on GPU used SpeedIT Flow.

3.1 Stationary flow

Four aerodynamic studies were selected:

  • aeroCar – case with 3.1M cells,
  • solarCar – case with 3.7M cells,
  • motorBike – case with 6.5M cells,
  • DrivAer – case with 10.2M cells.
Figure 2: Simulation times
Figure 2: Simulation times

All of cases can be seen on the Figure 1. A standard OpenFOAM solvers configuration was used together with the k-omegaSST turbulence model. Simulations were run for prescribed number of time steps to compare the time-to-solution for different architectures.
The results for the stationary cases are shown in the Figure 2. It can be seen that for those cases the SpeedIT Flow times-to-solution are comparable to computations on a single node of the Ruby cluster. The computation times for the IBM processor and two nodes of the Ruby cluster are comparable and about two times shorter.

3.2 Non-stationary flow

The stationary cases used the high relative tolerance factor so the number of linear solvers iterations were quite low. As this is what the SpeedIT Flow is optimized for we prepared another test. As the non-stationary flow case the blood flow through LCA was chosen. Used geometry can be seen in the Figure 3.
Again a standard OpenFOAM solvers configuration was used. In this example the inlet blood flow velocity is changing in time as would in real life flow. As the velocities are small enough no turbulence model was used. Simulations were run for prescribed number of time steps to compare the time-to-solution for different architectures.
The results for the non-stationary cases are shown in the Figure 4. It can be seen that for those cases the SpeedIT Flow times are comparable to computations on one socket of the IBM cluster and two nodes of the Ruby cluster. Moreover, it can be seen that the higher number of iterations of the pressure equation solver the higher acceleration can be obtained using SpeedIT Flow.

Figure 3: Artery geometry.
Figure 3: Artery geometry.
Figure 4: Left - simulation times; right - acceleration of SpeedIT Flow in respect to IBM (dotted) and number of iterations on the pressure equation solver (line)
Figure 4: Left – simulation times; right – acceleration of SpeedIT Flow in respect to IBM (dotted) and number of iterations on the pressure equation solver (line)

4. Summary

SpeedIT Flow is a new solver that completely changes the paradigm of running CFD calculations. It gives the end-user an alternative way to reduce turnaround times. By taking advantage of GPUs, that are available in most of the systems, the simulation time can be reduced bringing significant cost savings to the production pipeline. Finally, flexible licensing that is not dependent on number of CPU cores but on number of GPUs reduce the costs of software licensing.
As it was shown in the tests, SpeedIT Flow computational times using a single GPU are comparable to ones obtained using up to forty threads on modern CPUs. This number can be even higher when the number of linear solvers iterations will be higher, for example for bigger non-stationary simulation cases.
SpeedIT Flow is an attractive solution for both individual users and HPC centers. It can be also an alternative to new investments in hardware. As the computations on GPUs are comparable to the once on the modern CPUs instead of buying a new CPU-based system the end-user could just furnish an old one with GPUs. This solution should be cheaper and as effective as the new hardware.
With our software resource providers such as OSC and private or public cloud providers can utilize their hardware more efficiently. On a cluster equipped with GPUs CFD simulations could be run at the same time on CPUs as on GPUs. For example, for a Ruby node with two GPUs and a two CPUs with ten cores each three simulations could be run: two on GPUs and one on unused eighteen cores. As shown in our tests the turnaround times are comparable. For the IBM cluster the best possible setup would be a bit different. Two simulations could be run with 39 threads on each CPU and two simulations on GPUs served by the remaining CPU threads.

Acknowledgements

acnowl

We would like to thank IBM Germany for providing access to the cluster at IBM Power Acceleration and Design Center Boeblingen as well as consulting the results. Performance tests on the Ruby cluster were done during the Vratis-OSC-Ubercloud experiment. Details of this experiment can be found here.

 


This offering is not approved or endorsed by OpenCFD Limited, the producer of the OpenFOAM software and owner of the OPENFOAM® and OpenCFD® trademarks.