• Home
  • News
  • Tutorials
  • Analysis
  • About
  • Contact

TechEnablement

Education, Planning, Analysis, Code

  • CUDA
    • News
    • Tutorials
    • CUDA Study Guide
  • OpenACC
    • News
    • Tutorials
    • OpenACC Study Guide
  • Xeon Phi
    • News
    • Tutorials
    • Intel Xeon Phi Study Guide
  • OpenCL
    • News
    • Tutorials
    • OpenCL Study Guide
  • Web/Cloud
    • News
    • Tutorials
You are here: Home / CUDA / ExaFMM: An Exascale-capable, TF/s per GPU or Xeon Phi, Long-Range Force Library for Particle Simulations

ExaFMM: An Exascale-capable, TF/s per GPU or Xeon Phi, Long-Range Force Library for Particle Simulations

May 1, 2014 by Rob Farber Leave a Comment

Rio Yokota has implemented exaFMM, a Fast Multipole Method library to speed applications that must quickly calculate the effects of long-range forces such as gravity or magnetism on discrete particles in a simulation. Based on work he performed as a post-doc with Lorena Barba, the open-source FMM library runs on GPUs, multicore, and Intel Xeon Phi plus most of the leadership-class supercomputers around the world. Rio ExaFMM The P2P kernel of the exaFMM library can exceed a TF/s of performance on a single Intel Xeon Phi Knights Landing coprocessor or NVIDIA Kepler GPU. To my knowledge, this is second user-written (e.g. non-MKL library) code released to users that achieves this level of performance on Intel Xeon Phi.  (The first is the Farber machine learning code described here).

KNC announcement

Realizing in user code the TF/s performance announced at Supercomputing 2011 by Intel

N-body simulation is one of the “seven dwarfs of HPC”, or common patterns that have been identified by the HPC community for the most important algorithm classes in numerical simulation. The dwarfs identify particular patterns of computation and communication that are common to generally-utilized HPC scientific codes.

  1. Dense linear algebra
  2. Sparse linear algebra
  3. Spectral methods
  4. N-body methods
  5. Structured grids
  6. Unstructured grids
  7. Monte Carlo methods

Rio quantifies the exaFMM performance in terms of the roofline model, a performance metric that encapsulates all aspects of the machine performance in a single value. As you can see in the plot in the slide below, exaFMM achieves very high performance on both Intel Xeon Phi and NVIDIA Kepler accelerators. (A pdf of the slides can be found at this link.)

 Rio seven dwarfs

The Roofline Model and Seven Dwarfs of HPC

You can see a large particle number N-body system in operation in this video clip from the NVIDIA GTC conference http://youtu.be/aByz-mxOXJM?t=2m40s Understanding computational intensity Most computationally oriented scientists and programmers are familiar with BLAS (the Basic Linear Algebra Subprograms) library. BLAS is the de facto programming interface for basic linear algebra. BLAS is structured according to three different levels with increasing data and runtime requirements.

  • Level-1: Vector-vector operations that require O(N) data and O(N) work.  Examples include taking the inner product of two vectors, or scaling a vector by a constant multiplier.
  • Level-2: Matrix-vector operations that require O(N2) data and O(N2) work. Examples include matrix-vector multiplication or a single right-hand-side triangular solve.
  • Level-3 Matrix-vector operations that require O(N2) data and O(N3) work. Examples include dense matrix-matrix multiplication.

The table below illustrates the amount of work that is performed by each BLAS level assuming that N floating-point values are transferred from the host to the GPU. It does not take into account the time required to transfer the data back to the GPU.

BLAS level Data Work Work per Datum
1 O(N) O(N) O(1)
2 O(N2) O(N2) O(1)
3 O(N2) O(N3) O(N)

Table: Work per datum by BLAS level

The “work per Datum” column tells us that Level-3 BLAS operations should run efficiently on GPUs and coprocessors because they perform O(N) work  for every floating-point value transferred to the device. Thus the device spends its time calculating and not waiting for data. This same work-per-datum analysis applies to non-BLAS related computational problems such as N-body simulations. For example, the following slide shows that exaFMM spends most of the time computing (the red bars) on the Intel Xeon Phi processors in the TACC Stampede supercomputer.  In other words, the time spent computing dominates the runtime even when using many Intel Xeon Phi devices (up to 256 in this slide). Weak-scaling describes how the solution time varies given the problem size (e.g. workload) assigned to each processing element remains constant while additional elements are used to solve a larger overall problem. Rio weak scaling   Example two loop n-body code with O(N^2) scaling   Algorithmic speedups The code for the O(N2) runtime N-body simulation is quite simple as can be seen below. In contrast, the FMM method runs in O(N) – or a N-times faster – which can be quite significant in million or billion particle systems. Example two loop n-body code with O(N^2) scaling The exaFMM library handles the complexity involved in correctly and efficiently implementing the Fast Multipole Method which is definitely a non-trivial algorithm. Even the most ardent “I’ll write my own code” advocate will be happy to use exaFMM after delving into the myriad of details required to implement a full and computationally efficient FMM method. More precisely, the Lorena Barba at Boston University notes:

Multipole expansions are used to represent the effect of a cluster of particles, and are built for parent cells by an easy transformation (M2M). Local expansions are used to evaluate the effect of multipole expansions on many target points locally. The multipole-to-local (M2L) transformation is used only on the FMM, while treecodes compute the effect of multipole expansions directly on target points, with O(N log N) scaling for N particles. The FMM has the ideal O(N) scaling.

Rio FMM graphic   While collaborating with Rio during my recent visit to KAUS, we found that using aligned vectors was key to getting much better performance than is reported below. The change was simple and required switching from malloc() to the _mm_malloc() vector aligned memory allocation call. Prior to my visit, performance of the P2P (Particle to Particle) kernel was around 800 GF/s. Switching to _mm_malloc() bumped performance by about 200 GF/s. Rio now reports that additional work has resulted in Intel Xeon Phi performance exceeding 1 TF/s per device! Rio FMM p2p graphic   Following are some of the 2013 overall exaFMM performance numbers. Rio single phi performacne     Rio comparision The exaFMM library is available on a variety of leadership-class supercomputers around the world.   Rio many systems   However, if you need to calculate long-range forces between large numbers of particles on a small cluster or even a single device, then download exaFMM and give it a try.

Share this:

  • Twitter

Filed Under: CUDA, Featured article, News, News, Tutorials, Xeon Phi Tagged With: CUDA, HPC, Intel Xeon Phi, x86

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Tell us you were here

Recent Posts

Farewell to a Familiar HPC Friend

May 27, 2020 By Rob Farber Leave a Comment

TechEnablement Blog Sunset or Sunrise?

February 12, 2020 By admin Leave a Comment

The cornerstone is laid – NVIDIA acquires ARM

September 13, 2020 By Rob Farber Leave a Comment

Third-Party Use Cases Illustrate the Success of CPU-based Visualization

April 14, 2018 By admin Leave a Comment

More Tutorials

Learn how to program IBM’s ‘Deep-Learning’ SyNAPSE chip

February 5, 2016 By Rob Farber Leave a Comment

Free Intermediate-Level Deep-Learning Course by Google

January 27, 2016 By Rob Farber Leave a Comment

Intel tutorial shows how to view OpenCL assembly code

January 25, 2016 By Rob Farber Leave a Comment

More Posts from this Category

Top Posts & Pages

  • Intel Broadwell Compute Gen8 GPU Architecture
  • NVIDIA GTC 2015 keynote - Near-term Roadmap is Deep-Learning
  • LibreOffice OpenCL Acceleration for the Masses - Intel vs. AMD GPU performance
  • Recovering Speech from a Potato-chip Bag Viewed Through Soundproof Glass - Even With Commodity Cameras!
  • MultiOS Gaming, Media, and OpenCL Using XenGT Virtual Machines On Shared Intel GPUs

Archives

© 2025 · techenablement.com