Effective extensible programming: unleashing Julia on GPUs

T Besard, C Foket, B De Sutter - IEEE Transactions on Parallel …, 2018 - ieeexplore.ieee.org
GPUs and other accelerators are popular devices for accelerating compute-intensive,
parallelizable applications. However, programming these devices is a difficult task. Writing …

Reverse-mode automatic differentiation and optimization of GPU kernels via Enzyme

WS Moses, V Churavy, L Paehler… - Proceedings of the …, 2021 - dl.acm.org
Computing derivatives is key to many algorithms in scientific computing and machine
learning such as optimization, uncertainty quantification, and stability analysis. Enzyme is a …

SkePU 2: Flexible and type-safe skeleton programming for heterogeneous parallel systems

A Ernstsson, L Li, C Kessler - International Journal of Parallel …, 2018 - Springer
In this article we present SkePU 2, the next generation of the SkePU C++ skeleton
programming framework for heterogeneous parallel systems. We critically examine the …

Register optimizations for stencils on GPUs

PS Rawat, F Rastello, A Sukumaran-Rajam… - Proceedings of the 23rd …, 2018 - dl.acm.org
The recent advent of compute-intensive GPU architecture has allowed application
developers to explore high-order 3D stencils for better computational accuracy. A common …

Understanding the GPU microarchitecture to achieve bare-metal performance tuning

X Zhang, G Tan, S Xue, J Li, K Zhou… - Proceedings of the 22nd …, 2017 - dl.acm.org
In this paper, we present a methodology to understand GPU microarchitectural features and
improve performance for compute-intensive kernels. The methodology relies on a reverse …

Cudaadvisor: Llvm-based runtime profiling for modern gpus

D Shen, SL Song, A Li, X Liu - … of the 2018 International Symposium on …, 2018 - dl.acm.org
General-purpose GPUs have been widely utilized to accelerate parallel applications. Given
a relatively complex programming model and fast architecture evolution, producing efficient …

The missing pieces of open design enablement: A recent history of google efforts

T Ansell, M Saligane - Proceedings of the 39th International Conference …, 2020 - dl.acm.org
In an initiative to advance the open-source electronic design automation (EDA) and
hardware design community, Google has been spearheading a global collaborative effort …

[HTML][HTML] Optimization of flexible neighbors lists in Smoothed Particle Hydrodynamics on GPU

G Bilotta, V Zago, A Hérault, A Cappello… - … in Engineering Software, 2024 - Elsevier
Recent refactoring of the GPUSPH codebase have uncovered some of the limitations of the
official CUDA compiler (nvcc) offered by NVIDIA when dealing with some C++ constructs …

Guardian: Safe GPU Sharing in Multi-Tenant Environments

M Pavlidakis, G Vasiliadis, S Mavridis… - Proceedings of the 25th …, 2024 - dl.acm.org
Modern GPU applications, such as machine learning (ML), can only partially utilize GPUs,
leading to GPU underutilization in cloud environments. Sharing GPUs across multiple …

Cuda flux: A lightweight instruction profiler for cuda applications

L Braun, H Fröning - 2019 IEEE/ACM Performance Modeling …, 2019 - ieeexplore.ieee.org
GPUs are powerful, massively parallel processors, which require a vast amount of thread
parallelism to keep their thousands of execution units busy, and to tolerate latency when …