SpeedIT Flow and IBM Power 8

1. Introduction

Today CFD simulations are becoming more and more computationally demanding. In many areas of science and industry there is a need to guarantee short turnaround times and fast time-to-market. Such goals can be fulfilled only with huge investments in hardware and software licenses.
Graphics Processing Units provide completely new possibilities for significant cost savings because simulation time can be reduced on hardware that is often less expensive than server-class CPUs. Almost every PC contains a graphics card that supports either CUDA or OpenCL.
SpeedIT Flow is one of the fastest Computational Fluid Dynamics (CFD) implicit single-phase flow solvers currently available. In contrary to other solutions a Semi-Implicit Method for Pressure Linked Equations (SIMPLE) and the Pressure Implicit with Operator Splitting (PISO) algorithms have been completely implemented on Graphics Processing Unit (GPU). The software is particularly useful in acceleration of time-consuming external and internal flow simulations, such as aerodynamics studies in car and aviation industries.
SpeedIT Flow robust solver technology empowers users by providing accuracy in double precision on fully unstructured meshes up to 11 million cells. Our implementation was validated on standard steady and unsteady industry-relevant problems with RANS turbulence model. Its computational efficiency has been evaluated against the modern CPU configurations using OpenFOAM
This technology may be particularly interesting for HPC centres, because our software offers better utilization of their resources. The computations may be done on the CPUs and GPUs concurrently. If there are multiple GPUs in the system, independent computing tasks such as in the parameter sweep studies, can be solved simultaneously. When cases are solved on GPU the CPU resources are free and can be used for other tasks such as pre- and post-processing. Moreover the power efficiency per simulation, which is an important factor in high performance computing, is comparable for a dual-socket multicore CPU and a GPU.
Innovation comes also from other CPU manufacturers like for example IBM with their newest hardware. POWER8 is a family of superscalar symmetric multiprocessors based on the Power Architecture. POWER8 is designed to be a massively multithreaded chip, with each of its cores capable of handling eight hardware threads simultaneously, for a total of 96 threads executed simultaneously on a 12-core chip. The processor makes use of very large amounts of on- and off-chip eDRAM caches, and on-chip memory controllers enable very high bandwidth to memory and system I/O.
In this article we will show the performance of those two solutions on the typical CFD studies and compare the results to the once obtained with a supercomputer at Ohio Supercomputing Center (OSC).

2. Hardware

Tests were done in three locations on di fferent hardware:

1) IBM Power Acceleration and Design Center Boeblingen (PADC)

  • processor: IBM S824L, 10 cores, 3.4 GHz,
  • OS: Ubuntu 14.04.

2) OSC supercomputer Ruby2 (Full speci fation can be found at https://www.osc.edu/supercomputing/hpc)

Figure 1: Test cases. From top left: AeroCar and SolarCar - geometries from 4-ID Network; motorBike - geometry from OpenFOAM tutorial; DrivAer - geometry from Institute of Aerodynamics and Fluid Mechanics at TUM. All geometry providers are kindly acknowledged.
Figure 1: Test cases. From top left: AeroCar and SolarCar – geometries from 4-ID Network; motorBike – geometry from OpenFOAM tutorial; DrivAer – geometry from Institute of Aerodynamics and Fluid Mechanics at TUM. All geometry providers are kindly acknowledged.

3) Vratis in-house cluster

  • GPU: NVIDIA Quadro K6000,
  • OS: Ubuntu 12.04.4.

 3. Test cases

To test the performance of those systems we have selected some typical cases for both: stationary and non-stationary flows. Simulations were done on the hardware given in Section 2. Simulations run on CPUs used OpenFOAM software. On Ruby cluster simulations were run on two processors (20 cores) per node. On IBM cluster computations were run on a single processor. Power8 processors are capable of running up to eight threads can be run on each core. For OpenFOAM the best results are obtained using four threads per core. As the used processor had ten cores the simulation case was run with forty processes. Simulations run on GPU used SpeedIT Flow.

3.1 Stationary flow

Four aerodynamic studies were selected:

  • aeroCar – case with 3.1M cells,
  • solarCar – case with 3.7M cells,
  • motorBike – case with 6.5M cells,
  • DrivAer – case with 10.2M cells.
Figure 2: Simulation times
Figure 2: Simulation times

All of cases can be seen on the Figure 1. A standard OpenFOAM solvers configuration was used together with the k-omegaSST turbulence model. Simulations were run for prescribed number of time steps to compare the time-to-solution for different architectures.
The results for the stationary cases are shown in the Figure 2. It can be seen that for those cases the SpeedIT Flow times-to-solution are comparable to computations on a single node of the Ruby cluster. The computation times for the IBM processor and two nodes of the Ruby cluster are comparable and about two times shorter.

3.2 Non-stationary flow

The stationary cases used the high relative tolerance factor so the number of linear solvers iterations were quite low. As this is what the SpeedIT Flow is optimized for we prepared another test. As the non-stationary flow case the blood flow through LCA was chosen. Used geometry can be seen in the Figure 3.
Again a standard OpenFOAM solvers configuration was used. In this example the inlet blood flow velocity is changing in time as would in real life flow. As the velocities are small enough no turbulence model was used. Simulations were run for prescribed number of time steps to compare the time-to-solution for different architectures.
The results for the non-stationary cases are shown in the Figure 4. It can be seen that for those cases the SpeedIT Flow times are comparable to computations on one socket of the IBM cluster and two nodes of the Ruby cluster. Moreover, it can be seen that the higher number of iterations of the pressure equation solver the higher acceleration can be obtained using SpeedIT Flow.

Figure 3: Artery geometry.
Figure 3: Artery geometry.
Figure 4: Left - simulation times; right - acceleration of SpeedIT Flow in respect to IBM (dotted) and number of iterations on the pressure equation solver (line)
Figure 4: Left – simulation times; right – acceleration of SpeedIT Flow in respect to IBM (dotted) and number of iterations on the pressure equation solver (line)

4. Summary

SpeedIT Flow is a new solver that completely changes the paradigm of running CFD calculations. It gives the end-user an alternative way to reduce turnaround times. By taking advantage of GPUs, that are available in most of the systems, the simulation time can be reduced bringing significant cost savings to the production pipeline. Finally, flexible licensing that is not dependent on number of CPU cores but on number of GPUs reduce the costs of software licensing.
As it was shown in the tests, SpeedIT Flow computational times using a single GPU are comparable to ones obtained using up to forty threads on modern CPUs. This number can be even higher when the number of linear solvers iterations will be higher, for example for bigger non-stationary simulation cases.
SpeedIT Flow is an attractive solution for both individual users and HPC centers. It can be also an alternative to new investments in hardware. As the computations on GPUs are comparable to the once on the modern CPUs instead of buying a new CPU-based system the end-user could just furnish an old one with GPUs. This solution should be cheaper and as effective as the new hardware.
With our software resource providers such as OSC and private or public cloud providers can utilize their hardware more efficiently. On a cluster equipped with GPUs CFD simulations could be run at the same time on CPUs as on GPUs. For example, for a Ruby node with two GPUs and a two CPUs with ten cores each three simulations could be run: two on GPUs and one on unused eighteen cores. As shown in our tests the turnaround times are comparable. For the IBM cluster the best possible setup would be a bit different. Two simulations could be run with 39 threads on each CPU and two simulations on GPUs served by the remaining CPU threads.

Acknowledgements

acnowl

We would like to thank IBM Germany for providing access to the cluster at IBM Power Acceleration and Design Center Boeblingen as well as consulting the results. Performance tests on the Ruby cluster were done during the Vratis-OSC-Ubercloud experiment. Details of this experiment can be found here.

 


This offering is not approved or endorsed by OpenCFD Limited, the producer of the OpenFOAM software and owner of the OPENFOAM® and OpenCFD® trademarks.

SpeedIT Flow in the UberCloud Marketplace

Together with the Ohio Supercomputer Center (OSC) and the UberCloud we have prepared a preconfigured, tested and validated environment where SpeedIT Flow is ready to be used. In this sales model the customer instead of buying the license, pays for actual consumption only. This model may be particularly important for small companies and research groups with limited budget and for bigger companies who want to reduce the simulation costs.

This technology may be particularly interesting for HPC centers, such as OSC, because our software offers better utilization of their resources. The computations may be done on the CPUs and GPUs concurrently. Moreover the power efficiency per simulation, which is an important factor in high performance computing, is comparable for a dual-socket multicore CPU and a GPU.

Four OpenFOAM test cases were run on two different clusters in OSC, Oakley with Intel Xeon X5650 processors and Ruby with Intel Xeon E5-2670 v2 processors. Results were compared to SpeedIT Flow which was run on the Ruby using NVIDIA Tesla K40 GPU.

The scaling results showed that SpeedIT Flow is capable of running CFD simulations on a single GPU in times comparable to those obtained using 16-20 cores of a modern server-class CPU. Also electric energy consumption per simulation is comparable to those needed by computations on multicore CPUs.

SpeedIT Flow gives the end-user an alternative way to reduce turnaround times. By taking advantage of GPUs, that are available in most of the systems, the simulation time can be reduced bringing significant cost savings to the production pipeline. Finally, flexible licensing that is not dependent on number of CPU cores but on number of GPUs reduces the costs of software licensing.

With our software resource providers such as OSC and private or public cloud providers can utilize their hardware more efficiently. On a cluster with GPUs CFD simulations can be run at the same time on CPUs as on GPUs. For example for a node with two GPUs and two CPUs with ten cores each three simulations could be run: two on GPUs and one on unused eighteen cores. As shown in our tests the turnaround times and the power consumption per simulation are comparable.

Read the full case study here.

Figure 1: Left – visualization of the DrivAer case; Right – scaling results for the DrivAer case, turnaroud time for SpeedIT Flow is comparable to those on 16-20 cores of modern CPUs.
Figure 1: Left – visualization of the DrivAer case; Right – scaling results for the DrivAer case, turnaroud time for SpeedIT Flow is comparable to those on 16-20 cores of modern CPUs.
Figure 2: Number of simulations per day (left) and per day per Watt (right) of the DrivAer test case computed on a single node of Oakley (12 cores of Intel Xeon X5650) and Ruby (20 cores of Intel Xeon E5-2670 v2) using OpenFOAM and a single GPU (NVIDIA Tesla K40) using SpeedIT Flow.
Figure 2: Number of simulations per day (left) and per day per Watt (right) of the DrivAer test case computed on a single node of Oakley (12 cores of Intel Xeon X5650) and Ruby (20 cores of Intel Xeon E5-2670 v2) using OpenFOAM and a single GPU (NVIDIA Tesla K40) using SpeedIT Flow.

Higher productivity of single-phase flow simulations thanks to GPU acceleration

Higher productivity of single-phase flow simulations thanks to GPU acceleration
NACA 2412 wing test case

Case setup

A flow over a simple wing with a NACA 2412 airfoil was simulated. The wings parameters were defined as follows:

  • chord – c = 1m,
  • taper ratio – ctip/croot = 1,
  • wing span – b = 6m – span,
  • wing area – S = 6m2.

A symmetry boundary condition was used so flow over half of the wing was simulated. The computational domain was 21m x 10m x 12m. Inlet boundary had the half-cylinder shape so the different angles of attack could be simulated without changes to the mesh. Mesh had 3401338 (3.4M) cells.

Figure 1. Mesh visualization
Figure 1. Mesh visualization
Figure 2. Zoomed part of the mesh where the wing connects to the symmetry plane
Figure 2. Zoomed part of the mesh where the wing connects to the symmetry plane

Simulations were run for different angles of attack from 0 to 45 degrees with a 5 degree step. In a single simulation 500 steps was done using SIMPLE algorithm using both OpenFOAM and SpeedIT FLOW.

Following boundary conditions were used:

  • Fixed velocity value on the inlet and lower boundaries,
  • Zero gradient pressure on the outlet and upper boundaries,
  • Slip condition on the side boundaries.

Following numerical schemes were used:

  • Gauss linear for gradient,
  • Gauss upwind for divergence,
  • Gauss linear corrected for laplacian,
  • Linear interpolation.

Results

Results from OpenFOAM (OF) and SpeedIT Flow (SITF) were compared. In Fig. 3 a comparison between lift and drag coefficients for different angles of attack computed with OpenFOAM and SpeedIT Flow is shown. The coefficients are nearly identical up to a angle of 35 degrees. A small differences for higher angles may be caused by the flow separation behind the wing. Flow over the wing is shown in Fig. 4. For 40 degrees angle of attack a large recirculation zone can be seen. In Fig. 5 a lift to drag ratio is shown. The results obtained with OpenFOAM and SpeedIT Flow are nearly identical for the whole range of investigated angles of attack.

Figure 3. Lift and drag coefficients computed by OpenFOAM (OF) and SpeedIT Flow (SITF) after 500 steps of the SIMPLE algorithm for different angels of attack
Figure 3. Lift and drag coefficients computed by OpenFOAM (OF) and SpeedIT Flow (SITF) after 500 steps of the SIMPLE algorithm for different angels of attack
Figure 4. Visualization of pressure field on the plane of symmetry and wing and streamlines colored with velocity magnitude with different angle of attack: left – 0 deg., center – 20 deg., right – 40 deg.
Figure 5. Lift to drag ratio calculated with OpenFOAM (OF) and SpeedIT Flow (SITF) after 500 steps with SIMPLE algorithm for different angles of attack
Figure 5. Lift to drag ratio calculated with OpenFOAM (OF) and SpeedIT Flow (SITF) after 500 steps with SIMPLE algorithm for different angles of attack


Acceleration

For the solution of the NACA 2412 wing we used following hardware:

  • CPU: 2x Intel(R) Xeon(R) CPU E5649 @ 2.53GHz (24 threads),
  • GPU: NVIDIA Quadro K6000 12GB RAM,
  • RAM: 96GB
  • OS : Ubuntu 12.04.4 LTS 64bit

Times to solution and acceleration for each angle of attack is given in the Fig. 6. The sum of times of simulations are:

  • OF: 26965 s,
  • SITF: 7722 s.

which gives an acceleration of 3.5x.

Figure 6. Comparisons of time to solution for calculations done with OpenFOAM (OF) and SpeedIT Flow (SITF)
Figure 6. Comparisons of time to solution for calculations done with OpenFOAM (OF) and SpeedIT Flow (SITF)

Validation

Comparison of numerical results obtained with OpenFOAM and SpeedIT Flow is shown in Fig. 7. For lower angles of attack there is a good agreement between the results. For the 40 degrees angle of attack there are some differences caused by the formation of the separation of the flow.

Figure 7. Comparison of numerical results obtained for OpenFOAM (line) and SpeedIT Flow (dots). Upper row – results on a veritcal line 1m behind wing, lower row – pressure distribution along wing section, for different angles of attack: left – 0 deg., center – 20 deg., right – 40 deg.

Open Source clSPARSE Beta Released

We are happy to announce that we are open sourcing SpeedIT, our first commercial product for sparse linear algebra in OpenCL. We believe this decision will be beneficial for our customers, academia, the community and the HPC market.

Our most efficient kernels from SpeedIT will now be integrated into clSPARSE, an open source OpenCL™ sparse linear algebra library, created in partnership with AMD.

clSPARSE is the fourth library addition to clMathLibraries. It expands upon the available dense clBLAS (Basic Linear Algebra Subprograms), clFFT (Fast Fourier Transform) and clRNG (random number generator) offerings already available.

The source is released under the Apache license as a Beta release. We release this to the public to receive feedback, comments and constructive criticism, which may all be filed as Github issues in the repository’s ticketing system. All of our current issues are open to the public to view and comment on. As a Beta release, we reserve the right to tinker with the API and make changes, all depending on constructive feedback we receive.

At the first release, clSPARSE provides these fundamental sparse operations for OpenCL:

  • Sparse Matrix – dense Vector multiply (SpM-dV)
  • Sparse Matrix – dense Matrix multiply (SpM-dM)
  • Iterative conjugate gradient solver (CG)
  • Iterative biconjugate gradient stabilized solver (BiCGStab)
  • Dense to CSR conversions (& converse)
  • COO to CSR conversions (& converse)
  • Functions to read matrix market files in COO or CSR format

The library source code compiles cross-platform on the back of an advanced cmake build system that allows users to choose to build the library and supporting benchmarks/tests, and takes care of dependencies for them. True to the spirit of the other clMath libraries, clSPARSE exports a “C” interface to allow developers to build wrappers around clSPARSE in any language they need. The advantage of using the open source clMath libraries is that the user does not have to write or understand the OpenCL kernels; the implementation is abstracted from the user allowing to them to focus on memory placement and transport. Still, the source to the kernels is open for those who wish to understand the implementation details.

A great deal of thought and effort went into designing the API’s to make them less ‘cluttered’. OpenCL state is not explicitly passed through the API, which enables the library to be forward compatible when users are ready to switch from OpenCL 1.2 to OpenCL 2.0. Lastly, we designed the API’s such that users control where input and output buffers live. You also have absolute control over when data transfers to/from device memory need to happen, so that there are no performance surprises.
You can leave general feedback about clSPARSE on this blog. For issues or specific feedback and suggestions, please use the Github issues tracker for this project.

SpeedIT FLOW 0.3 Released

We are happy to announce SpeedIT FLOW ver. 0.3.

SpeedIT FLOW is a RANS single-phase fluid flow solver that runs fully on GPU.

SpeedIT FLOW ver. 0.3

  • RANS turbulence modeling of incompressible fluids (kOmegaSST),
  • Supported boundary conditions: kqRWallFunction, omegaWallFunction, nutkWallFunction, inletOutlet, slip
  • Supported discretization scheme: upwind for div(phi)
  • Supported OpenFOAM versions: 1.7.1 – 2.3.0
GPU vs. CPU. Motorbike, 6M cells, aero flow: simpleFoam+kOmegaSST
GPU vs. CPU. Motorbike, 6M cells, aero flow: simpleFoam+kOmegaSST

In summary: now we solve external aero flow (motorbike) and other industry-relevant OpenFOAM cases on a GPU card ca. x3 faster vs. Intel Xeon E5649 running 12 cores. This is about two times faster than competing solutions that offer only partial acceleration on GPU.

See this presentation for the latest results.

Your honest opinion if such a product is attractive for the market is highly appreciated. Maybe there are still some features missing?

Contact us at info at vratis.com if you would like to test this version.

Best regards,
SpeedIT Team

Previous Releases

Release ver. 0.1

  • Full GPU acceleration of SIMPLE, PISO solvers
  • Transient and steady state flows
  • Boundary Conditions: zeroGradient, fixed value,
  • CG and BiCG linear solvers with diagonal preconditioner

Release ver. 0.2

  • AMG preconditioner for CG solver