SpeedIT Flow and IBM Power 8

1. Introduction

Today CFD simulations are becoming more and more computationally demanding. In many areas of science and industry there is a need to guarantee short turnaround times and fast time-to-market. Such goals can be fulfilled only with huge investments in hardware and software licenses.
Graphics Processing Units provide completely new possibilities for significant cost savings because simulation time can be reduced on hardware that is often less expensive than server-class CPUs. Almost every PC contains a graphics card that supports either CUDA or OpenCL.
SpeedIT Flow is one of the fastest Computational Fluid Dynamics (CFD) implicit single-phase flow solvers currently available. In contrary to other solutions a Semi-Implicit Method for Pressure Linked Equations (SIMPLE) and the Pressure Implicit with Operator Splitting (PISO) algorithms have been completely implemented on Graphics Processing Unit (GPU). The software is particularly useful in acceleration of time-consuming external and internal flow simulations, such as aerodynamics studies in car and aviation industries.
SpeedIT Flow robust solver technology empowers users by providing accuracy in double precision on fully unstructured meshes up to 11 million cells. Our implementation was validated on standard steady and unsteady industry-relevant problems with RANS turbulence model. Its computational efficiency has been evaluated against the modern CPU configurations using OpenFOAM
This technology may be particularly interesting for HPC centres, because our software offers better utilization of their resources. The computations may be done on the CPUs and GPUs concurrently. If there are multiple GPUs in the system, independent computing tasks such as in the parameter sweep studies, can be solved simultaneously. When cases are solved on GPU the CPU resources are free and can be used for other tasks such as pre- and post-processing. Moreover the power efficiency per simulation, which is an important factor in high performance computing, is comparable for a dual-socket multicore CPU and a GPU.
Innovation comes also from other CPU manufacturers like for example IBM with their newest hardware. POWER8 is a family of superscalar symmetric multiprocessors based on the Power Architecture. POWER8 is designed to be a massively multithreaded chip, with each of its cores capable of handling eight hardware threads simultaneously, for a total of 96 threads executed simultaneously on a 12-core chip. The processor makes use of very large amounts of on- and off-chip eDRAM caches, and on-chip memory controllers enable very high bandwidth to memory and system I/O.
In this article we will show the performance of those two solutions on the typical CFD studies and compare the results to the once obtained with a supercomputer at Ohio Supercomputing Center (OSC).

2. Hardware

Tests were done in three locations on di fferent hardware:

1) IBM Power Acceleration and Design Center Boeblingen (PADC)

  • processor: IBM S824L, 10 cores, 3.4 GHz,
  • OS: Ubuntu 14.04.

2) OSC supercomputer Ruby2 (Full speci fation can be found at https://www.osc.edu/supercomputing/hpc)

Figure 1: Test cases. From top left: AeroCar and SolarCar - geometries from 4-ID Network; motorBike - geometry from OpenFOAM tutorial; DrivAer - geometry from Institute of Aerodynamics and Fluid Mechanics at TUM. All geometry providers are kindly acknowledged.

Figure 1: Test cases. From top left: AeroCar and SolarCar – geometries from 4-ID Network; motorBike – geometry from OpenFOAM tutorial; DrivAer – geometry from Institute of Aerodynamics and Fluid Mechanics at TUM. All geometry providers are kindly acknowledged.

3) Vratis in-house cluster

  • GPU: NVIDIA Quadro K6000,
  • OS: Ubuntu 12.04.4.

 3. Test cases

To test the performance of those systems we have selected some typical cases for both: stationary and non-stationary flows. Simulations were done on the hardware given in Section 2. Simulations run on CPUs used OpenFOAM software. On Ruby cluster simulations were run on two processors (20 cores) per node. On IBM cluster computations were run on a single processor. Power8 processors are capable of running up to eight threads can be run on each core. For OpenFOAM the best results are obtained using four threads per core. As the used processor had ten cores the simulation case was run with forty processes. Simulations run on GPU used SpeedIT Flow.

3.1 Stationary flow

Four aerodynamic studies were selected:

  • aeroCar – case with 3.1M cells,
  • solarCar – case with 3.7M cells,
  • motorBike – case with 6.5M cells,
  • DrivAer – case with 10.2M cells.
Figure 2: Simulation times

Figure 2: Simulation times

All of cases can be seen on the Figure 1. A standard OpenFOAM solvers configuration was used together with the k-omegaSST turbulence model. Simulations were run for prescribed number of time steps to compare the time-to-solution for different architectures.
The results for the stationary cases are shown in the Figure 2. It can be seen that for those cases the SpeedIT Flow times-to-solution are comparable to computations on a single node of the Ruby cluster. The computation times for the IBM processor and two nodes of the Ruby cluster are comparable and about two times shorter.

3.2 Non-stationary flow

The stationary cases used the high relative tolerance factor so the number of linear solvers iterations were quite low. As this is what the SpeedIT Flow is optimized for we prepared another test. As the non-stationary flow case the blood flow through LCA was chosen. Used geometry can be seen in the Figure 3.
Again a standard OpenFOAM solvers configuration was used. In this example the inlet blood flow velocity is changing in time as would in real life flow. As the velocities are small enough no turbulence model was used. Simulations were run for prescribed number of time steps to compare the time-to-solution for different architectures.
The results for the non-stationary cases are shown in the Figure 4. It can be seen that for those cases the SpeedIT Flow times are comparable to computations on one socket of the IBM cluster and two nodes of the Ruby cluster. Moreover, it can be seen that the higher number of iterations of the pressure equation solver the higher acceleration can be obtained using SpeedIT Flow.

Figure 3: Artery geometry.

Figure 3: Artery geometry.

Figure 4: Left - simulation times; right - acceleration of SpeedIT Flow in respect to IBM (dotted) and number of iterations on the pressure equation solver (line)

Figure 4: Left – simulation times; right – acceleration of SpeedIT Flow in respect to IBM (dotted) and number of iterations on the pressure equation solver (line)

4. Summary

SpeedIT Flow is a new solver that completely changes the paradigm of running CFD calculations. It gives the end-user an alternative way to reduce turnaround times. By taking advantage of GPUs, that are available in most of the systems, the simulation time can be reduced bringing significant cost savings to the production pipeline. Finally, flexible licensing that is not dependent on number of CPU cores but on number of GPUs reduce the costs of software licensing.
As it was shown in the tests, SpeedIT Flow computational times using a single GPU are comparable to ones obtained using up to forty threads on modern CPUs. This number can be even higher when the number of linear solvers iterations will be higher, for example for bigger non-stationary simulation cases.
SpeedIT Flow is an attractive solution for both individual users and HPC centers. It can be also an alternative to new investments in hardware. As the computations on GPUs are comparable to the once on the modern CPUs instead of buying a new CPU-based system the end-user could just furnish an old one with GPUs. This solution should be cheaper and as effective as the new hardware.
With our software resource providers such as OSC and private or public cloud providers can utilize their hardware more efficiently. On a cluster equipped with GPUs CFD simulations could be run at the same time on CPUs as on GPUs. For example, for a Ruby node with two GPUs and a two CPUs with ten cores each three simulations could be run: two on GPUs and one on unused eighteen cores. As shown in our tests the turnaround times are comparable. For the IBM cluster the best possible setup would be a bit different. Two simulations could be run with 39 threads on each CPU and two simulations on GPUs served by the remaining CPU threads.

Acknowledgements

acnowl

We would like to thank IBM Germany for providing access to the cluster at IBM Power Acceleration and Design Center Boeblingen as well as consulting the results. Performance tests on the Ruby cluster were done during the Vratis-OSC-Ubercloud experiment. Details of this experiment can be found here.

 


This offering is not approved or endorsed by OpenCFD Limited, the producer of the OpenFOAM software and owner of the OPENFOAM® and OpenCFD® trademarks.

Share

SpeedIT Flow in the UberCloud Marketplace

Together with the Ohio Supercomputer Center (OSC) and the UberCloud we have prepared a preconfigured, tested and validated environment where SpeedIT Flow is ready to be used. In this sales model the customer instead of buying the license, pays for actual consumption only. This model may be particularly important for small companies and research groups with limited budget and for bigger companies who want to reduce the simulation costs.

This technology may be particularly interesting for HPC centers, such as OSC, because our software offers better utilization of their resources. The computations may be done on the CPUs and GPUs concurrently. Moreover the power efficiency per simulation, which is an important factor in high performance computing, is comparable for a dual-socket multicore CPU and a GPU.

Four OpenFOAM test cases were run on two different clusters in OSC, Oakley with Intel Xeon X5650 processors and Ruby with Intel Xeon E5-2670 v2 processors. Results were compared to SpeedIT Flow which was run on the Ruby using NVIDIA Tesla K40 GPU.

The scaling results showed that SpeedIT Flow is capable of running CFD simulations on a single GPU in times comparable to those obtained using 16-20 cores of a modern server-class CPU. Also electric energy consumption per simulation is comparable to those needed by computations on multicore CPUs.

SpeedIT Flow gives the end-user an alternative way to reduce turnaround times. By taking advantage of GPUs, that are available in most of the systems, the simulation time can be reduced bringing significant cost savings to the production pipeline. Finally, flexible licensing that is not dependent on number of CPU cores but on number of GPUs reduces the costs of software licensing.

With our software resource providers such as OSC and private or public cloud providers can utilize their hardware more efficiently. On a cluster with GPUs CFD simulations can be run at the same time on CPUs as on GPUs. For example for a node with two GPUs and two CPUs with ten cores each three simulations could be run: two on GPUs and one on unused eighteen cores. As shown in our tests the turnaround times and the power consumption per simulation are comparable.

Read the full case study here.

Figure 1: Left – visualization of the DrivAer case; Right – scaling results for the DrivAer case, turnaroud time for SpeedIT Flow is comparable to those on 16-20 cores of modern CPUs.

Figure 1: Left – visualization of the DrivAer case; Right – scaling results for the DrivAer case, turnaroud time for SpeedIT Flow is comparable to those on 16-20 cores of modern CPUs.

Figure 2: Number of simulations per day (left) and per day per Watt (right) of the DrivAer test case computed on a single node of Oakley (12 cores of Intel Xeon X5650) and Ruby (20 cores of Intel Xeon E5-2670 v2) using OpenFOAM and a single GPU (NVIDIA Tesla K40) using SpeedIT Flow.

Figure 2: Number of simulations per day (left) and per day per Watt (right) of the DrivAer test case computed on a single node of Oakley (12 cores of Intel Xeon X5650) and Ruby (20 cores of Intel Xeon E5-2670 v2) using OpenFOAM and a single GPU (NVIDIA Tesla K40) using SpeedIT Flow.

Share

Higher productivity of single-phase flow simulations thanks to GPU acceleration

Higher productivity of single-phase flow simulations thanks to GPU acceleration
NACA 2412 wing test case

Case setup

A flow over a simple wing with a NACA 2412 airfoil was simulated. The wings parameters were defined as follows:

  • chord – c = 1m,
  • taper ratio – ctip/croot = 1,
  • wing span – b = 6m – span,
  • wing area – S = 6m2.

A symmetry boundary condition was used so flow over half of the wing was simulated. The computational domain was 21m x 10m x 12m. Inlet boundary had the half-cylinder shape so the different angles of attack could be simulated without changes to the mesh. Mesh had 3401338 (3.4M) cells.

Figure 1. Mesh visualization

Figure 1. Mesh visualization

Figure 2. Zoomed part of the mesh where the wing connects to the symmetry plane

Figure 2. Zoomed part of the mesh where the wing connects to the symmetry plane

Simulations were run for different angles of attack from 0 to 45 degrees with a 5 degree step. In a single simulation 500 steps was done using SIMPLE algorithm using both OpenFOAM and SpeedIT FLOW.

Following boundary conditions were used:

  • Fixed velocity value on the inlet and lower boundaries,
  • Zero gradient pressure on the outlet and upper boundaries,
  • Slip condition on the side boundaries.

Following numerical schemes were used:

  • Gauss linear for gradient,
  • Gauss upwind for divergence,
  • Gauss linear corrected for laplacian,
  • Linear interpolation.

Results

Results from OpenFOAM (OF) and SpeedIT Flow (SITF) were compared. In Fig. 3 a comparison between lift and drag coefficients for different angles of attack computed with OpenFOAM and SpeedIT Flow is shown. The coefficients are nearly identical up to a angle of 35 degrees. A small differences for higher angles may be caused by the flow separation behind the wing. Flow over the wing is shown in Fig. 4. For 40 degrees angle of attack a large recirculation zone can be seen. In Fig. 5 a lift to drag ratio is shown. The results obtained with OpenFOAM and SpeedIT Flow are nearly identical for the whole range of investigated angles of attack.

Figure 3. Lift and drag coefficients computed by OpenFOAM (OF) and SpeedIT Flow (SITF) after 500 steps of the SIMPLE algorithm for different angels of attack

Figure 3. Lift and drag coefficients computed by OpenFOAM (OF) and SpeedIT Flow (SITF) after 500 steps of the SIMPLE algorithm for different angels of attack

Figure 4. Visualization of pressure field on the plane of symmetry and wing and streamlines colored with velocity magnitude with different angle of attack: left – 0 deg., center – 20 deg., right – 40 deg.

Figure 5. Lift to drag ratio calculated with OpenFOAM (OF) and SpeedIT Flow (SITF) after 500 steps with SIMPLE algorithm for different angles of attack

Figure 5. Lift to drag ratio calculated with OpenFOAM (OF) and SpeedIT Flow (SITF) after 500 steps with SIMPLE algorithm for different angles of attack


Acceleration

For the solution of the NACA 2412 wing we used following hardware:

  • CPU: 2x Intel(R) Xeon(R) CPU E5649 @ 2.53GHz (24 threads),
  • GPU: NVIDIA Quadro K6000 12GB RAM,
  • RAM: 96GB
  • OS : Ubuntu 12.04.4 LTS 64bit

Times to solution and acceleration for each angle of attack is given in the Fig. 6. The sum of times of simulations are:

  • OF: 26965 s,
  • SITF: 7722 s.

which gives an acceleration of 3.5x.

Figure 6. Comparisons of time to solution for calculations done with OpenFOAM (OF) and SpeedIT Flow (SITF)

Figure 6. Comparisons of time to solution for calculations done with OpenFOAM (OF) and SpeedIT Flow (SITF)

Validation

Comparison of numerical results obtained with OpenFOAM and SpeedIT Flow is shown in Fig. 7. For lower angles of attack there is a good agreement between the results. For the 40 degrees angle of attack there are some differences caused by the formation of the separation of the flow.

Figure 7. Comparison of numerical results obtained for OpenFOAM (line) and SpeedIT Flow (dots). Upper row – results on a veritcal line 1m behind wing, lower row – pressure distribution along wing section, for different angles of attack: left – 0 deg., center – 20 deg., right – 40 deg.

Share

Open Source clSPARSE Beta Released

We are happy to announce that we are open sourcing SpeedIT, our first commercial product for sparse linear algebra in OpenCL. We believe this decision will be beneficial for our customers, academia, the community and the HPC market.

Our most efficient kernels from SpeedIT will now be integrated into clSPARSE, an open source OpenCL™ sparse linear algebra library, created in partnership with AMD.

clSPARSE is the fourth library addition to clMathLibraries. It expands upon the available dense clBLAS (Basic Linear Algebra Subprograms), clFFT (Fast Fourier Transform) and clRNG (random number generator) offerings already available.

The source is released under the Apache license as a Beta release. We release this to the public to receive feedback, comments and constructive criticism, which may all be filed as Github issues in the repository’s ticketing system. All of our current issues are open to the public to view and comment on. As a Beta release, we reserve the right to tinker with the API and make changes, all depending on constructive feedback we receive.

At the first release, clSPARSE provides these fundamental sparse operations for OpenCL:

  • Sparse Matrix – dense Vector multiply (SpM-dV)
  • Sparse Matrix – dense Matrix multiply (SpM-dM)
  • Iterative conjugate gradient solver (CG)
  • Iterative biconjugate gradient stabilized solver (BiCGStab)
  • Dense to CSR conversions (& converse)
  • COO to CSR conversions (& converse)
  • Functions to read matrix market files in COO or CSR format

The library source code compiles cross-platform on the back of an advanced cmake build system that allows users to choose to build the library and supporting benchmarks/tests, and takes care of dependencies for them. True to the spirit of the other clMath libraries, clSPARSE exports a “C” interface to allow developers to build wrappers around clSPARSE in any language they need. The advantage of using the open source clMath libraries is that the user does not have to write or understand the OpenCL kernels; the implementation is abstracted from the user allowing to them to focus on memory placement and transport. Still, the source to the kernels is open for those who wish to understand the implementation details.

A great deal of thought and effort went into designing the API’s to make them less ‘cluttered’. OpenCL state is not explicitly passed through the API, which enables the library to be forward compatible when users are ready to switch from OpenCL 1.2 to OpenCL 2.0. Lastly, we designed the API’s such that users control where input and output buffers live. You also have absolute control over when data transfers to/from device memory need to happen, so that there are no performance surprises.
You can leave general feedback about clSPARSE on this blog. For issues or specific feedback and suggestions, please use the Github issues tracker for this project.

Share

SpeedIT FLOW 0.3 Released

We are happy to announce SpeedIT FLOW ver. 0.3.

SpeedIT FLOW is a RANS single-phase fluid flow solver that runs fully on GPU.

SpeedIT FLOW ver. 0.3

  • RANS turbulence modeling of incompressible fluids (kOmegaSST),
  • Supported boundary conditions: kqRWallFunction, omegaWallFunction, nutkWallFunction, inletOutlet, slip
  • Supported discretization scheme: upwind for div(phi)
  • Supported OpenFOAM versions: 1.7.1 – 2.3.0
GPU vs. CPU. Motorbike, 6M cells, aero flow: simpleFoam+kOmegaSST

GPU vs. CPU. Motorbike, 6M cells, aero flow: simpleFoam+kOmegaSST

In summary: now we solve external aero flow (motorbike) and other industry-relevant OpenFOAM cases on a GPU card ca. x3 faster vs. Intel Xeon E5649 running 12 cores. This is about two times faster than competing solutions that offer only partial acceleration on GPU.

See this presentation for the latest results.

Your honest opinion if such a product is attractive for the market is highly appreciated. Maybe there are still some features missing?

Contact us at info at vratis.com if you would like to test this version.

Best regards,
SpeedIT Team

Previous Releases

Release ver. 0.1

  • Full GPU acceleration of SIMPLE, PISO solvers
  • Transient and steady state flows
  • Boundary Conditions: zeroGradient, fixed value,
  • CG and BiCG linear solvers with diagonal preconditioner

Release ver. 0.2

  • AMG preconditioner for CG solver
Share

SpeedIT FLOW Benchmark Test

This presentation shows our recent benchmark test where we compare SpeedIT FLOW running on a single Tesla M2050 GPU card vs. OpenFOAM running on 12 CPU threads (Intel Xeon E5649).

Share

SpeedIT 2.4 vs. OpenFOAM

Introduction

SpeedIT 2.4 is the next version of leading software for accelerating CFD on GPUs. The results show that SpeedIT is a good choice for users with desktop computers who want to accelerate OpenFOAM on their machines. Users with server-class CPUs should follow the development of SpeedIT FLOW.

SpeedIT 2.4 Features:
- OpenCL version of Conjugate Gradient, BiConjugate Gradient together with diagonal preconditioner.
- OpenCL version of Sparse Matrix-Vector Multiplication.
Performance

The performance has been tested on three cases: external flow simulation over a simplified model of a car Ahmedbody with 1.37M cells, and blood flow simulations through basiliary and caretoid arteries.

Screen shot 2013-07-18 at 22.00.35

Fig. Acceleration of OpenFOAM on GPU using SpeedIT. On CPUs OpenFOAM was run with 4 MPI threads and GAMG.

SpeedIT vs. OpenFOAM

Fig. Acceleration of OpenFOAM on GPU using SpeedIT. On CPUs OpenFOAM was run with 4 MPI threads and GAMG.

SpeedIT vs. OpenFOAM
Fig. Acceleration of OpenFOAM on GPU using SpeedIT. On CPUs OpenFOAM was run with 4 MPI threads and GAMG.

Conclusions

SpeedIT successfuly accelerates realistic simulations run on desktop machines to a satisfactory extent. However, for the cases where the number of iterations of iterative solvers is small accelerating them on GPU does not bring high speedup. Server-class CPUs are still beyond the reach of SpeedIT. The alternative approach where the solvers fully run on GPU is much more effective (see SpeedIT FLOW)

Share

How to run SpeedIT with OpenFOAM?

Introduction

SpeedIT plugin for OpenFOAM is a set of libraries which allows you to accelerate OpenFOAM on GPU. SpeedIT will release the computational power dreaming in NVIDIA Graphics Processing Unit (GPU) that supports CUDA technology. The SpeedIT library provides a set of accelerated solvers and functions for sparse linear systems of equations which are:

  • Preconditioned Conjugate Gradient
  • Preconditioned Stabilized Bi-Conjugate Gradient
  • Accelerated Sparse Matrix-Vector Multiplication
  • Diagonal Preconditioner
  • Algebraic Multigrid (AMG) based on Smoothed Aggregation
  • Approximate Inverse (AINV)

Requirements

Software dependencies

OpenFOAM

OpenFOAM is an environment where SpeedIT plugin operates. OpenFOAM can be downloaded from http://www.openfoam.com/download/. Install OpenFOAM by following the instructions on the OpenFOAM page.

IMPORTANT: Make sure you have done step which sets the OpenFOAM environment variables

CUDA

Download CUDA library from http://developer.nvidia.com/cuda-downloads and install it. Add CUDA include directory to your PATH variable:

PATH=$PATH:/path/to/cuda/include

Depends on your system (32/64-bit) add CUDA lib or lib64 to LD_LIBRARY_PATH i.e:

LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/cuda/lib:/path/to/cuda/lib64

SpeedIT

OpenFOAM plugin requires SpeedIT to work. SpeedIT is available commercially and SpeedIT Classic with limited functionality can be download at no cost.  SpeedIT can be downloaded from http://speedit.vratis.com

Cuwrap

Cuwrap is an intermediate library which achieve compatibility between CUDA and OpenFOAM interfaces. It is distributed with the OpenFOAM plugin. You can find it in the folder cuwrap. It is necessary to build this library if you want to use SpeedIT with OpenFoam.

To build this library first in cuwrap folder open Makefile file.

Depends on your configuration set proper paths to CUDA environment.
For 32-bit suystem and default CUDA installation the header of the file should looks following:

CUDA_HOME="/usr/local/cuda" 

CUDA_LIB="$CUDA_HOME/lib" 

NVIDIA_CURRENT="/usr/lib" 

CUDA_INC="$CUDA_HOME/include" 

CUDA_BIN="$CUDA_HOME/bin"

For 64-bit systems:

CUDA_HOME="/usr/local/cuda" 

CUDA_LIB="$CUDA_HOME/lib64" 

NVIDIA_CURRENT="/usr/lib64" 

CUDA_INC="$CUDA_HOME/include" 

CUDA_BIN="$CUDA_HOME/bin"

After setting paths from the cuwrap folder run make command. It shall build the library. If  OpenFOAM is configured properly the library should be created inside $FOAM_USER_LIBBIN folder.

Plugin Installation

  1. Create directory $HOME/OpenFOAM
  2. Create additional directories by typing:
    mkdir $WM_PROJECT_USER_DIR | mkdir $FOAM_RUN
  3. From the plugin directory

     run ./Allwmake COMMERCIAL
  4. If compilation is successfully completed you should have new file libexternalsolv.so and libcuwrap.so in $FOAM_USER_LIBBIN directory.

Plugin use

  1. Copy (or make symbolic links)following libraries to $FOAM_USER_LIBBIN directory:

  • libcublas.so

  • libcudart.so

  • libcuwrap.so

  • libspeedit.so

libcublas.so, libcudart.so, are from NVIDIA CUDA toolkit

libcuwrap.so is distributed with the plugin.

NOTE: Remember to use proper version of libraries depends on your system architecture. 32-bit library in 32-bit operating systems and 64-bit library for 64-bit operating systems.

    1. Go in to the directory with your OpenFOAM case, i.e. $FOAM_RUN/tutorials/incompressible/icoFoam/cavity
    2. Append

      libs
          (
            "libcudart.so"
            "libcublas.so"
            "libcuwrap.so"    
            "libspeedit.so”
            "libexternalsolv.so"
      );

      to the end of your system/controDict file for every FOAM case, for which you want to use external, accelerated solvers.

  • In file system/fvSolution change solver names for solvers, for which you are going to enable acceleration. Remember to use proper names for accelerated solvers. You may replace:

    PBiCG with SI_PBiCG
    PCG   with SI_PCG
  • For accelerated solvers choose an appropriate preconditioner in file system/fvSolution. You may use following preconditioners:

    1. SI_DIAGONAL – Diagonal preconditioner

    2. SI_AMG – Algebraic Multigrid preconditioner

    3. SI_AINV – Approximate Inverse preconditioner

    4. SI_AINV_SC – Approximete Inverse Scaled preconditioner

    5. SI_AINV_NS – Approximate Inverse Non-Symmetric preconditioner

  • When accelerated solvers are used you have to specify additional keyword “matrix” in solver definition. It can take 2 values CMR or CSR which stands for:

    1. CSR – Compressed Sparse Row format.

    2. CMR – Compressed Multi-Row Storage format (see our article for details)

When CSR is used, then all preconditioners mentioned in point 5 are allowed. When CMR matrix is used then only SI_DIAGONAL is working at the moment.

Run icoFOAM from $FOAM_RUN/tutorials/incompressible/icoFoam/cavity.

Accelerated solvers should be available from now.

Example of fvSolution:


/*——————————–*- C++ -*———————————-*\

 

| ========= | |

 

| \\ / F ield | OpenFOAM: The Open Source CFD Toolbox |

 

| \\ / O peration | Version: 1.7.1 |

 

| \\ / A nd | Web: www.OpenFOAM.com |

 

| \\/ M anipulation | |

 

\*—————————————————————————*/

 

FoamFile

 

{

 

version 2.0;

 

format ascii;

 

class dictionary;

location “system”;

object fvSolution;

}

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //

solvers {

p  {

solver SI_PCG;

preconditioner SI_AMG;

matrix CSR;

tolerance 1e-06;

relTol 0;

}

{

solver SI_PBiCG;

preconditioner SI_DIAGONAL;

matrix CSR;

tolerance 1e-05;

relTol 0;

}

PISO  {

nCorrectors 2;

nNonOrthogonalCorrectors 0;

pRefCell 0;

pRefValue 0;

}

// ************************************************************************* //

Share

SpeedIT FLOW accelerates OpenFOAM

Introduction

Our recent findings indicate that the SpeedIT alone cannot accelerate OpenFOAM (and probably other CFD codes) to the satisfactory extent. If you follow our recent reports you will see SpeedIT is attractive for desktop computers but performs worse when compared to server class CPUs, such as Intel Xeon. The reason for such mild acceleration is the Amdahl’s Law which states that the acceleration is bounded by the percentage of the code that cannot be parallalized. Since in non-linear Navier-Stokes solvers only a fragment of the algorithm is accelerated by iterative solvers run on the GPU, acceleration is limited. The only reasonable solution is to implement the whole algorithm on a GPU card.

SpeedIT FLOW is a newly developed solver for incompressible transient and steady-state laminar flows that supports 3D unstructured grids and OpenFOAM format. It has been  fully implemented on GPU using CUDA. The solver currently implements PISO and SIMPLE algorithms, a selection of boundary conditions and was thoroughly tested for a number of OpenFOAM cases. Maximal acceleration is up to x3.5 when compared to OpenFOAM run on a multi-core Intel Xeon with 12 MPI threads and fastest possible multigrid solver (GAMG). To our knowledge SpeedIT FLOW is the fastest accelerator of OpenFOAM publically available.

Methodology

The methodology of our approach takes advantage of the structure of tetrahedral mesh itself as well as neighborhood properties. A special data format has been designed to efficiently access the memory in a coaleased way. More information about the prototype of the technology is available in Int J Comp. Fluid Dynamics.

Performance

Fig. 1 presents the acceleration of SpeedIT FLOW on a single NVIDIA Tesla vs. OpenFOAM run on Intel Xeon with 12 cores.Tab. 1 presents the duration of selected test cases: lid driven cavity flow with varying number of cells, Poiseuille Flow and the blood flow through coronary arteries.

SpeedIT Flows vs. OpenFOAM

Fig. 1. Acceleration of SpeedIT FLOW run on NVIDIA Tesla C2050 vs. OpenFOAM run on Intel Xeon with 12 cores.

SpeedIT FLOW &  PISO Intel Xeon SpeedIT FLOW &  SIMPLE Intel Xeon
Case diagonal+CG AMG+CG GAMG diagonal+CG AMG+CG GAMG
cavity3D, 10K 190.0 172.0 33.3 13.9 13.5 0.8
cavity3D, 100K 655.0 445.0 379.7 123.0 81 69.1
cavity3D, 1M 4026.0 1542.0 5093.3 2062.0 821 2773.2
coronary_artery 3436.0 1077.0 1114 348.0 140 158.4
Poiseuille Flow 55877.0 1776.0 4182.4 - - -

Table 1. Duration of the simulations in seconds. SpeedIT FLOW run on NVIDIA Tesla C2050 vs. OpenFOAM run on Intel Xeon with 12 cores.

Validation

In order to validate the solver numerically, the results were compared with the results from the same tests run in OpenFOAM with both SIMPLE and PISO solvers.

Tab. 2 and Tab. 3 present the norm between OpenFOAM and SpeedIT Flow results. The norm is defined as a maximal absolute difference between pressure and velocity fields for all the cells in both cases.

Case t [sec] p norm U norm
cavity3D, 1MLN 0.32 7.64e-06 4.73e-06
cavity3D, 100K 0.32 2.33e-07 4.48e-07
cavity3D, 1000 0.32 1.39e-08 1.30e-09
coronary_artery 0.2 1.17e-06 2.60e-05
Poiseuille Flow 0.5 2.46e-08 1.80e-07

Table 2: Largest absolute difference in velocity magnitue and pressure between SpeedIT FLOW and OpenFOAM for time-dependent flows (PISO).

Case t [sec] p norm U norm
cavity3D, 1MLN 0.32 1.97e-04 8.41e-04
cavity3D, 100K 0.32 4.23e-05 1.78e-04
cavity3D, 1000 0.32 2.78e-06 9.48e-06
coronary_artery 0.2 3.09e-06 7.25e-05

Table 3: Largest absolute difference in velocity magnitue and pressure between SpeedIT FLOW and OpenFOAM for stationary flows (SIMPLE).

Next two figures present the plot lines for both OpenFOAM and SpeedIT FLOW for cavity 3D and Poiseuille Flow run with SIMPLE and PISO, respectively.

 

Geometry in 3D cavityScreen shot 2013-07-11 at 19.22.32

 

Poiseuille flowPoiseuille Flow

Conclusions

SpeedIT FLOW is a 3D solver for incompressible, laminar, transient and steady-state flows fully implemented on GPU. The results clearly show that achieved acceleration depends strongly on the size of the case and number of iterations per a time step. For cavity3D case the performance is reasonable when the original mesh has about a milion cells. Also in case of time-dependent flows the acceleration is acceptable.

Unfotunately, because used AMG implementation requires much memory the maximal case tha fits into GPU memory is about 4.74 millions cells. Therefore, the next goal is to add multi-GPU functionality and more efficient AMG implementation.

SpeedIT FLOW Features

  • Unstructured 3D Mesh Support
  • Incompressible, laminar transient and steady-state flows
  • Boundary conditions: time varying inlet conditions, fixed value, groovyBC, totalPressure.
  • Supports OpenFOAM Format.

Requirements

  • Linux (x86, x86-64 and Itanium).
  • NVIDIA GPU with 2.0 cc

More information : info (at) vratis.com or sales (at) vratis.com

Licensing

SpeedIT FLOW is in alpha version. Any suggestions, remakrs from interested parties will be kindly acknowledged.

None of the OPENFOAM® related products and services offered by Vratis Limited Sp. z o.o. are approved or authorized by OpenCFD Ltd. (ESI Group), owner of the OPENFOAM® and OpenCFD® trade marks and producer of the OpenFOAM software.

Share

Acceleration of OpenFOAM with SpeedIT 2.1

Acceleration of OpenFOAM with SpeedIT 2.1
Comparison to GAMG and DIC preconditioners

Vratis Ltd., Wroclaw, Poland
April 5, 2012

1. Objective

OpenFOAM® simulations take a significant amount of time leading to higher costs of simulations. GPGPU technology has a potential to overcome this problem. As a solution of this problem we propose to use SpeedIT technology that replaces iterative solvers in OpenFOAM with their GPU-accelerated versions. In following tests we accelerate calculation of pressure equation that usually takes most of the time in simulations of incompressible flows. We compare the performance of OpenFOAM & SpeedIT run on GPU to standard OpenFOAM on CPU using various preconditioners on a typical PC equipped with NVIDIA GPU card that is  CUDA compatible. This report is also used to present a new version of SpeedIT 2.1 that contains a new set of preconditioners.

2. Methodology

SpeedIT is a library which implements set of accelerated solvers with various preconditioners. Thanks to CUSP library in SpeedIT 2.1 we were able to utilize algebraic multigrid preconditioner with smoothed aggregation (AMG) . This preconditioner significally reduces number of iterations during the pressure calculation which imply shorter time of calculation. SpeedIT Plugin to OpenFOAM® was used to substitute OpenFOAM’s iterative solvers with the one provided by SpeedIT. Tests were performed on following machines:

A) CPU: Intel Core 2 Duo E8400, 3GHz, 8GB RAM @ 800MHz
GPU: Nvidia GTX 460, VRAM 1GB
Software: Ubuntu 11.04 x64, OpenFOAM 2.0.1, CUDA Toolkit 4.1
B) CPU: Intel Q8400, 2,66GHz, 8GB RAM @ 800MHz
GPU: Nvidia Tesla C2070, VRAM: 6GB.
Software: Ubuntu 11.04 x64, OpenFOAM 2.0.1, CUDA Toolkit 4.1

To solve pressure equation with OpenFOAM on CPU either GAMG solver or CG with DIC preconditioner was used for different number of cores. On GPU SpeedIT was run together with AMG preconditioner. We have tested following test cases for fixed number of time steps.

  1. Cavity 3D 512K cells, icoFoam, on 1 and 2 Cores with PCG solver and DIC preconditioner, GAMG solver, FDIC preconditioner, Gauss-Seidel smoother, and SpeedIT 2.1 with AMG preconditioners.

Picture 1. Cavity 3D, velocity streamlines

  1. Aorta 200K cells, simpleFoam, on 1 and 2 Cores with PCG solver and DIC preconditioner, GAMG solver, FDIC preconditioner, Gauss-Seidel smoother, and SpeedIT 2.1 with AMG preconditioner.
Picture 2. Aorta, velocity streamlines
  1. Ahmed case with 2.5M cells simulated with original simpleFoam, on 1, 2, 3 and 4 Cores with GAMG solver, Gauss-Seidel smoother and SpeedIT with AMG preconditioner.
    Picture 3. Ahmed 25º, velocity streamlines and pressure field.

Cases 1 and 2 were executed on machine A, and case 3 on machine B.

3. Validation

To validate our solution we have ploted pressure field along x axis for cases 1 and 2. From Figs. 1-3 it is quite clear that solutions are correct for simulations with different preconditioners.

Figure 1. Cavity 3D cross section along x axis. Solution for all preconditioners.
Figure 2. Aorta cross section along x axis. Solution for all preconditioners

Figure 3. Aorta cross section along x axis. Solution for all preconditioners.

4. Results
Cavity 3D

Figure 4. Execution time of Cavity 3D case for different preconditoners and number of cores.

Figure 5. Mean number of iterations for GAMG, AMG and DIC preconditioner during pressure calculations.

Figure 6. Acceleration defined as a ratio SpeedIT vs CPU with different preconditioners

Aorta

Figure 7. Execution time of Aorta case for different preconditoners and number of cores

Figure 8. Mean number of iterations for GAMG, SpeedIT with AMG and DIC preconditioner during pressure calculations

Figure 9. Acceleration defined as a ratio GPU (SpeedIT) vs. CPU with different preconditioners.

Ahmed 25º

Figure 10. Execution time for Ahmed case for GPU with AMG preconditioner and different number of cores with GAMG solver

Figs. 1-3 prove that SpeedIT leads to the same solution as OpenFOAM. SpeedIT new AMG preconditioner can be competitive with OpenFOAM GAMG preconditioner working on 1 or 2 core CPU.  The main advantage of the AMG solver is that significantly reduces number of iterations when solving the pressure equation. Comparing to widely used DIC preconditioner SpeedIT 2.1 gives about 10 time less iterations (Fig. 5, and Fig 8 ) which in effect gives a speedup up to 3.5x. What was interesting we found that GAMG is failing when calculations are performed in single precision while AMG is still functioning. Fig. 11 presents the mean number of iterations for the Cavity3D case in single precision. GAMG solver gives as much as 1000 of iterations during pressure field calculations.

Figure 11. Mean number of iterations for Cavity 3D case in single precision.

5. Acknowledgments

We would like to thank NVIDIA for hardware support and 4-ID network for providing the Ahmed test case. Ahmed test case was based on Motorbike tutorial from OpenFOAM 2.0. We also acknowledge Dominik Szczerba from IT’IS Foundation for providing the geometry of the human aorta.

Share