Open Source clSPARSE Beta Released

We are happy to announce that we are open sourcing SpeedIT, our first commercial product for sparse linear algebra in OpenCL. Our most efficient kernels from SpeedIT will now be integrated into clSPARSE, an open source OpenCL™ sparse linear algebra library, created in partnership with AMD.

Read more at vratis.com/blog.

SpeedIT Flow in the UberCloud Marketplace

SpeedIT Flow has been added to UberCloud Marketplace as Experiment 180. Its performance was tested on four different aerodynamic cases: AeroCar, SolarCar, Motorbike and DrivAer. The results show that SpeedIT Flow is capable of running CFD simulations on a single GPU in times comparable to those obtained using 16-20 cores of a modern server-class CPU. Read the full case study here.

SpeedIT Flow in the UberCloud Marketplace

Together with the Ohio Supercomputer Center (OSC) and the UberCloud we have prepared a preconfigured, tested and validated environment where SpeedIT Flow is ready to be used. In this sales model the customer instead of buying the license, pays for actual consumption only. This model may be particularly important for small companies and research groups with limited budget and for bigger companies who want to reduce the simulation costs.

This technology may be particularly interesting for HPC centers, such as OSC, because our software offers better utilization of their resources. The computations may be done on the CPUs and GPUs concurrently. Moreover the power efficiency per simulation, which is an important factor in high performance computing, is comparable for a dual-socket multicore CPU and a GPU.

Four OpenFOAM test cases were run on two different clusters in OSC, Oakley with Intel Xeon X5650 processors and Ruby with Intel Xeon E5-2670 v2 processors. Results were compared to SpeedIT Flow which was run on the Ruby using NVIDIA Tesla K40 GPU.

The scaling results showed that SpeedIT Flow is capable of running CFD simulations on a single GPU in times comparable to those obtained using 16-20 cores of a modern server-class CPU. Also electric energy consumption per simulation is comparable to those needed by computations on multicore CPUs.

SpeedIT Flow gives the end-user an alternative way to reduce turnaround times. By taking advantage of GPUs, that are available in most of the systems, the simulation time can be reduced bringing significant cost savings to the production pipeline. Finally, flexible licensing that is not dependent on number of CPU cores but on number of GPUs reduces the costs of software licensing.

With our software resource providers such as OSC and private or public cloud providers can utilize their hardware more efficiently. On a cluster with GPUs CFD simulations can be run at the same time on CPUs as on GPUs. For example for a node with two GPUs and two CPUs with ten cores each three simulations could be run: two on GPUs and one on unused eighteen cores. As shown in our tests the turnaround times and the power consumption per simulation are comparable.

Read the full case study here.

Figure 1: Left – visualization of the DrivAer case; Right – scaling results for the DrivAer case, turnaroud time for SpeedIT Flow is comparable to those on 16-20 cores of modern CPUs.
Figure 1: Left – visualization of the DrivAer case; Right – scaling results for the DrivAer case, turnaroud time for SpeedIT Flow is comparable to those on 16-20 cores of modern CPUs.
Figure 2: Number of simulations per day (left) and per day per Watt (right) of the DrivAer test case computed on a single node of Oakley (12 cores of Intel Xeon X5650) and Ruby (20 cores of Intel Xeon E5-2670 v2) using OpenFOAM and a single GPU (NVIDIA Tesla K40) using SpeedIT Flow.
Figure 2: Number of simulations per day (left) and per day per Watt (right) of the DrivAer test case computed on a single node of Oakley (12 cores of Intel Xeon X5650) and Ruby (20 cores of Intel Xeon E5-2670 v2) using OpenFOAM and a single GPU (NVIDIA Tesla K40) using SpeedIT Flow.

Higher productivity of single-phase flow simulations thanks to GPU acceleration

Higher productivity of single-phase flow simulations thanks to GPU acceleration
NACA 2412 wing test case

Case setup

A flow over a simple wing with a NACA 2412 airfoil was simulated. The wings parameters were defined as follows:

  • chord – c = 1m,
  • taper ratio – ctip/croot = 1,
  • wing span – b = 6m – span,
  • wing area – S = 6m2.

A symmetry boundary condition was used so flow over half of the wing was simulated. The computational domain was 21m x 10m x 12m. Inlet boundary had the half-cylinder shape so the different angles of attack could be simulated without changes to the mesh. Mesh had 3401338 (3.4M) cells.

Figure 1. Mesh visualization
Figure 1. Mesh visualization
Figure 2. Zoomed part of the mesh where the wing connects to the symmetry plane
Figure 2. Zoomed part of the mesh where the wing connects to the symmetry plane

Simulations were run for different angles of attack from 0 to 45 degrees with a 5 degree step. In a single simulation 500 steps was done using SIMPLE algorithm using both OpenFOAM and SpeedIT FLOW.

Following boundary conditions were used:

  • Fixed velocity value on the inlet and lower boundaries,
  • Zero gradient pressure on the outlet and upper boundaries,
  • Slip condition on the side boundaries.

Following numerical schemes were used:

  • Gauss linear for gradient,
  • Gauss upwind for divergence,
  • Gauss linear corrected for laplacian,
  • Linear interpolation.

Results

Results from OpenFOAM (OF) and SpeedIT Flow (SITF) were compared. In Fig. 3 a comparison between lift and drag coefficients for different angles of attack computed with OpenFOAM and SpeedIT Flow is shown. The coefficients are nearly identical up to a angle of 35 degrees. A small differences for higher angles may be caused by the flow separation behind the wing. Flow over the wing is shown in Fig. 4. For 40 degrees angle of attack a large recirculation zone can be seen. In Fig. 5 a lift to drag ratio is shown. The results obtained with OpenFOAM and SpeedIT Flow are nearly identical for the whole range of investigated angles of attack.

Figure 3. Lift and drag coefficients computed by OpenFOAM (OF) and SpeedIT Flow (SITF) after 500 steps of the SIMPLE algorithm for different angels of attack
Figure 3. Lift and drag coefficients computed by OpenFOAM (OF) and SpeedIT Flow (SITF) after 500 steps of the SIMPLE algorithm for different angels of attack
Figure 4. Visualization of pressure field on the plane of symmetry and wing and streamlines colored with velocity magnitude with different angle of attack: left – 0 deg., center – 20 deg., right – 40 deg.
Figure 5. Lift to drag ratio calculated with OpenFOAM (OF) and SpeedIT Flow (SITF) after 500 steps with SIMPLE algorithm for different angles of attack
Figure 5. Lift to drag ratio calculated with OpenFOAM (OF) and SpeedIT Flow (SITF) after 500 steps with SIMPLE algorithm for different angles of attack


Acceleration

For the solution of the NACA 2412 wing we used following hardware:

  • CPU: 2x Intel(R) Xeon(R) CPU E5649 @ 2.53GHz (24 threads),
  • GPU: NVIDIA Quadro K6000 12GB RAM,
  • RAM: 96GB
  • OS : Ubuntu 12.04.4 LTS 64bit

Times to solution and acceleration for each angle of attack is given in the Fig. 6. The sum of times of simulations are:

  • OF: 26965 s,
  • SITF: 7722 s.

which gives an acceleration of 3.5x.

Figure 6. Comparisons of time to solution for calculations done with OpenFOAM (OF) and SpeedIT Flow (SITF)
Figure 6. Comparisons of time to solution for calculations done with OpenFOAM (OF) and SpeedIT Flow (SITF)

Validation

Comparison of numerical results obtained with OpenFOAM and SpeedIT Flow is shown in Fig. 7. For lower angles of attack there is a good agreement between the results. For the 40 degrees angle of attack there are some differences caused by the formation of the separation of the flow.

Figure 7. Comparison of numerical results obtained for OpenFOAM (line) and SpeedIT Flow (dots). Upper row – results on a veritcal line 1m behind wing, lower row – pressure distribution along wing section, for different angles of attack: left – 0 deg., center – 20 deg., right – 40 deg.

Open Source clSPARSE Beta Released

We are happy to announce that we are open sourcing SpeedIT, our first commercial product for sparse linear algebra in OpenCL. We believe this decision will be beneficial for our customers, academia, the community and the HPC market.

Our most efficient kernels from SpeedIT will now be integrated into clSPARSE, an open source OpenCL™ sparse linear algebra library, created in partnership with AMD.

clSPARSE is the fourth library addition to clMathLibraries. It expands upon the available dense clBLAS (Basic Linear Algebra Subprograms), clFFT (Fast Fourier Transform) and clRNG (random number generator) offerings already available.

The source is released under the Apache license as a Beta release. We release this to the public to receive feedback, comments and constructive criticism, which may all be filed as Github issues in the repository’s ticketing system. All of our current issues are open to the public to view and comment on. As a Beta release, we reserve the right to tinker with the API and make changes, all depending on constructive feedback we receive.

At the first release, clSPARSE provides these fundamental sparse operations for OpenCL:

  • Sparse Matrix – dense Vector multiply (SpM-dV)
  • Sparse Matrix – dense Matrix multiply (SpM-dM)
  • Iterative conjugate gradient solver (CG)
  • Iterative biconjugate gradient stabilized solver (BiCGStab)
  • Dense to CSR conversions (& converse)
  • COO to CSR conversions (& converse)
  • Functions to read matrix market files in COO or CSR format

The library source code compiles cross-platform on the back of an advanced cmake build system that allows users to choose to build the library and supporting benchmarks/tests, and takes care of dependencies for them. True to the spirit of the other clMath libraries, clSPARSE exports a “C” interface to allow developers to build wrappers around clSPARSE in any language they need. The advantage of using the open source clMath libraries is that the user does not have to write or understand the OpenCL kernels; the implementation is abstracted from the user allowing to them to focus on memory placement and transport. Still, the source to the kernels is open for those who wish to understand the implementation details.

A great deal of thought and effort went into designing the API’s to make them less ‘cluttered’. OpenCL state is not explicitly passed through the API, which enables the library to be forward compatible when users are ready to switch from OpenCL 1.2 to OpenCL 2.0. Lastly, we designed the API’s such that users control where input and output buffers live. You also have absolute control over when data transfers to/from device memory need to happen, so that there are no performance surprises.
You can leave general feedback about clSPARSE on this blog. For issues or specific feedback and suggestions, please use the Github issues tracker for this project.