Optimization techniques for GPU programming

P Hijma, S Heldens, A Sclocco… - ACM Computing …, 2023 - dl.acm.org
In the past decade, Graphics Processing Units have played an important role in the field of
high-performance computing and they still advance new fields such as IoT, autonomous …

Optimizing CUDA code by kernel fusion: application on BLAS

J Filipovič, M Madzin, J Fousek, L Matyska - The Journal of …, 2015 - Springer
Contemporary GPUs have significantly higher arithmetic throughput than a memory
throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic …

Logca: A high-level performance model for hardware accelerators

MSB Altaf, DA Wood - ACM SIGARCH Computer Architecture News, 2017 - dl.acm.org
With the end of Dennard scaling, architects have increasingly turned to special-purpose
hardware accelerators to improve the performance and energy efficiency for some …

GHOST: building blocks for high performance sparse linear algebra on heterogeneous systems

M Kreutzer, J Thies, M Röhrig-Zöllner, A Pieper… - International Journal of …, 2017 - Springer
While many of the architectural details of future exascale-class high performance computer
systems are still a matter of intense research, there appears to be a general consensus that …

Pipelined approach to fused kernels for optimization of machine learning workloads on graphical processing units

A Ashari, M Boehm, KW Campbell… - US Patent …, 2018 - Google Patents
(57) ABSTRACT A method for optimization of machine learning (ML) work loads on a
graphics processor unit (GPU). The method includes identifying a computation having a …

Systematic fusion of CUDA kernels for iterative sparse linear system solvers

JI Aliaga, J Pérez, ES Quintana-Ortí - European Conference on Parallel …, 2015 - Springer
We introduce a systematic analysis in order to fuse CUDA kernels arising in efficient iterative
methods for the solution of sparse linear systems. Our procedure characterizes the input and …

Real-time optical flow processing on embedded GPU: an hardware-aware algorithm to implementation strategy

M Seznec, N Gac, F Orieux, AS Naik - Journal of Real-Time Image …, 2022 - Springer
Determining the optical flow of a video is a compute-intensive task essential for computer
vision. For achieving this processing in real time, the whole algorithm deployment chain …

Accelerating the task/data-parallel version of ILUPACK's BiCG in multi-CPU/GPU configurations

JI Aliaga, E Dufrechou, P Ezzatti, ES Quintana-Ortí - Parallel Computing, 2019 - Elsevier
ILUPACK is a valuable tool for the solution of sparse linear systems via iterative Krylov
subspace-based methods. Its relevance for the solution of real problems has motivated …

Time-domain simulation of large electric power systems using domain-decomposition and parallel processing methods

P Aristidou - 2015 - search.proquest.com
Dynamic simulation studies are used to analyze the behavior of power systems after a
disturbance has occurred. Over the last decades, they have become indispensable to …

Optimization and performance evaluation of the IDR iterative Krylov solver on GPUs

H Anzt, M Kreutzer, E Ponce… - … Journal of High …, 2018 - journals.sagepub.com
In this paper, we present an optimized GPU implementation for the induced dimension
reduction algorithm. We improve data locality, combine it with an efficient sparse matrix …