V. Volkov (University of California, Berkeley)

Experience in accelerating numerical routines using GPUs

I discuss some common topics in GPU code optimization. This includes (i) bandwidth optimizations, such as reducing memory traffic, improving spatial locality and caching data in registers, (ii) somewhat increasing the flop count in order to gain a substantial increase in the flop rate and (iii) swapping computation from GPU to CPU and back to get best of both systems. I illustrate these concepts with their application to some widely used numerical routines, such as matrix multiply, FFT, finite-difference stencils and linear algebra solvers.