GSIC International Workshop on GPGPU Applications

主催:学術国際情報センター・GPU コンピューティング研究会
協賛:GCOE「計算世界観の深化と展開」, CREST ULP-HPC
場所:学術国際情報センター・情報棟2F会議室 (キャンパスマップ)
趣旨:GPUの高いパフォーマンスに対する認識が広まり、GPGPUの様々な分野への学術利用、産業利用が検討されています。GPU コンピューティングの基本的特性およびアプリケーションに関する先進的研究の議論を通して、現状の課題および今後の方向性を見出すことを目的とします。

(Keynote) Prof. Lorena Barba (Boston University)

PetFMM--A dynamically load-balancing parallel fast multipole library

The fast multipole method (FMM) is a complex algorithm, and the programming difficulty associated with it has arguably diminished its impact, being a barrier for adoption. We have developed a library for N-body interactions utilizing the FMM algorithm, built as part of the framework of PETSc. A prominent feature of this algorithm is that it is designed to be extensible, with a view to unifying efforts involving many algorithms based on the same principles as the FMM and enabling easy development of scientific application codes. The parallel PetFMM relies on a model including both work and communication estimates, which is used to provide dynamic load balancing. We are currently working on making PetFMM a heterogeneous application with capacity to exploit GPU acceleration. The current progress has the most time consuming operation (namely, the translation of multipole to local expansions) running on CUDA at almost 500 gigaflops on one Tesla card.


Rio Yokota (Bristol University)

Viewing the Fast Multipole Method as a Fast Poisson Solver

As computer architectures become more and more parallel, it is worth reconsidering alternative numerical methods by taking into account the parallelism that it offers. The fast multipole method (FMM) is an interesting alternative to conventional Poisson solvers in this sense. It has been shown that the FMM can extract the full potential of massively parallel architectures such as large GPU clusters. The talk will focus on the future prospects of the relative performance of FMMs against fast Poisson solvers.


Marlon Arce Acuna (Tokyo Institute of Technology)

Multi-GPU Computing and Scalability for Real-Time Tsunami Simulation

With the introduction of GPGPU, a new revolution has been opened in high performance computing, the power of GPU can now be used to solve computing-demanding problems. A Tsunami simulation is part of this intensive computing. For modeling and accurately emit an early warning the Shallow Water Equations have to be solved in real-time. Even a single GPU enables 62-times faster calculation than 1 CPU core, moreover using domain decomposition Multi-GPU computing is studied where the communications between GPUs are hidden by overlapping with the computation. For a very large-size dataset of 4096x8192 mesh with 90m resolution the GPU Tsunami Simulation finished within 3 minutes in the case of 8 GPUs. Excellent scalability has been achieved on the TSUBAME GPU cluster.


Ali Cevahir (Tokyo Institute of Technology)

Parallel Conjugate Gradient Solver on Multi-GPU Clusters

We explain a scalable implementation of a CG solver for unstructured matrices on a cluster, where each cluster node has multiple GPUs. To achieve scalability, we extend hypergraph-partitioning-based matrix decomposition models. Each GPU automatically selects the fastest running matrix-vector multiplication kernel. As a result, 94 Gflops double-precision CG performance is achieved on 32 nodes with 64 GPUs.


Kenta Sugihara (Tokyo Institute of Technology)

Performances of higher-order advection schemes on multi-node GPU cluster

A high accurate advection calculation is important part of the CFD calculation, in order to resolve velocity and density profile. In this study, a performance study of higher-order advection schemes on multi-node GPU cluster is conducted. A three dimensional domain decomposition is adopted to the multi-node GPU parallelization using MPI library. By using 32 GPUs (Tesla S1070 on TSUBAME), 3.7 TFlops performance is obtained. The performance is about 3000 times faster than serial CPU performance. In order to get more performance, overlapping technique (GPU kernel, Device-Host and MPI communication) is adopted by using asynchronous calculation, and 5.3 TFlops is obtained (1.4 times speed-up). In this presentation, multi-GPU performance study and overlapping technique are presented.

17:00-17:30TSUBAME Tour
連絡担当:学術国際情報センター 青木 尊之
学術国際情報センター GPU コンピューティング研究会
mailto: gpu-computing-office _at_