SC14 - Fast Hybrid GPU Betweenness Centrality Code Achieves Nearly Ideal Scaling to 192 GPUs

Don’t miss the SC14 presentation Wednesday Nov. 19 in room 388-89-90, for the presentation of the McLaughlin and Bader paper “Scalable and High Performance Betweenness Centrality on the GPU“. The authors report nearly ideal scaling to 192 GPUs and billions of edges traversed per step (GTEP). The paper can be downloaded here and their software can be downloaded from github.

Over the past few years, betweenness centrality has become a heavily-used and popular strategy to deal with complex networks including social networks, computer networks, biology, transport, scientific cooperation and many others. A memory transaction intensive algorithm, betweenness centrality is a measure of a node’s centrality in a network that is equal to the number of shortest paths from all vertices to all others that pass through that node. Computational approaches that speed and scale betweenness centrality calculations benefit not only users, but computer scientists as well because new approaches can potentially be adapted to speed other graph algorithms. The McLaughlin and Bader method apparently shows near ideal scalability and very high performance.

David Bader (click image to learn more)

Adam McLaughlin (click image to learn more)

The McLaughlin and Bader abstract notes:

Prior GPU implementations suffer from large local data structures and inefficient graph traversals that limit scalability and performance. Here we present several hybrid GPU implementations, providing good performance on graphs of arbitrary structure rather than just scale-free graphs as was done previously. We achieve up to 13x speedup on high-diameter graphs and an average of 2.71x speedup overall over the best existing GPU algorithm. We observe near linear speedup and performance exceeding tens of GTEPS when running betweenness centrality on 192 GPUs.

The full source code can be downloaded from github.

Motivation

Massively parallel implementations of graph algorithms on affordable latency-hiding devices like GPUs and Intel Xeon Phi have created a unique opportunity for students and companies to affect how business is done throughout the computer industry. The timing is right as graph-based computational techniques are a focus of research funding by governments and cash-rich companies working in big data, social media, image recognition, network analysis and numerous other areas. The memory systems of GPUs (and Intel Xeon Phi) are of particular interest because they are designed to support massively parallel computation with hundreds to thousands of concurrent threads of execution through latency hiding and processor out-of-order instruction execution. These are key hardware characteristic to exploit as graph algorithms are memory-bound rather than computation-limited, which means that programmers who can devise scalable, extensible, and high-performance solutions to support generic graph algorithms have a golden key to advance their careers and achieve success in our ever-increasing data-rich world.

Graph problems are notorious for being memory latency limited rather than memory bandwidth limited. However, all that latency hiding technology that helps to sustain high flop rates in the massively parallel environment inside GPUs also helps to hide memory latency for graph-based computations. As a result, these devices are surprisingly competent graph processing engines.

For more information:

http://devblogs.nvidia.com/parallelforall/accelerating-graph-betweenness-centrality-cuda/

“Large-Scale Graph Processing Algorithms on the GPU” (a literature review of challenges for graph processing on GPUs)

The well-know paper by Duane Merrill, Michael Garland, Andrew Grimshaw, “High Performance and Scalable GPU Graph Traversal” was one of the first to investigate GPU graph traversal. The source code can be seen on Google Code back40computing.

Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter. We present a BFS parallelization focused on fine-grained task management that achieves an asymptotically optimal O(|V|+|E|) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations both CPU and GPU platforms.

Other GPU graph implementations
- MapGraph
  - NVIDIA GTC 2014: Yangzihao Wang, “High-Performance Graph Primitives on GPU: Design and Implementation of Gunrock” [video][pdf]
  - “MapGraph: A High Level API for Fast Development of High-Performance Graph Analytics on GPUs”
- Gunrock
  - NVIDIA GTC 2014: Yangzihao Wang, “High-Performance Graph Primitives on GPU: Design and Implementation of Gunrock” [video][pdf]
- VertexAPI2
  - NVIDIA GTC 2014: “Speeding Up GraphLab” [video]
- Breadth-First Graph Search Uses 2D Domain Decomposition – 400 GTEPS on 4096 GPUs

Additional (not necessarily GPU based) graph implementations of interest:
- Graphlab
- GraphChi

Motivation

For more information:

Share this:

Leave a Reply Cancel reply