Effective extensible programming: unleashing Julia on GPUs
GPUs and other accelerators are popular devices for accelerating compute-intensive,
parallelizable applications. However, programming these devices is a difficult task. Writing …
parallelizable applications. However, programming these devices is a difficult task. Writing …
Reverse-mode automatic differentiation and optimization of GPU kernels via Enzyme
Computing derivatives is key to many algorithms in scientific computing and machine
learning such as optimization, uncertainty quantification, and stability analysis. Enzyme is a …
learning such as optimization, uncertainty quantification, and stability analysis. Enzyme is a …
SkePU 2: Flexible and type-safe skeleton programming for heterogeneous parallel systems
In this article we present SkePU 2, the next generation of the SkePU C++ skeleton
programming framework for heterogeneous parallel systems. We critically examine the …
programming framework for heterogeneous parallel systems. We critically examine the …
Register optimizations for stencils on GPUs
The recent advent of compute-intensive GPU architecture has allowed application
developers to explore high-order 3D stencils for better computational accuracy. A common …
developers to explore high-order 3D stencils for better computational accuracy. A common …
Understanding the GPU microarchitecture to achieve bare-metal performance tuning
In this paper, we present a methodology to understand GPU microarchitectural features and
improve performance for compute-intensive kernels. The methodology relies on a reverse …
improve performance for compute-intensive kernels. The methodology relies on a reverse …
Cudaadvisor: Llvm-based runtime profiling for modern gpus
General-purpose GPUs have been widely utilized to accelerate parallel applications. Given
a relatively complex programming model and fast architecture evolution, producing efficient …
a relatively complex programming model and fast architecture evolution, producing efficient …
The missing pieces of open design enablement: A recent history of google efforts
In an initiative to advance the open-source electronic design automation (EDA) and
hardware design community, Google has been spearheading a global collaborative effort …
hardware design community, Google has been spearheading a global collaborative effort …
[HTML][HTML] Optimization of flexible neighbors lists in Smoothed Particle Hydrodynamics on GPU
Recent refactoring of the GPUSPH codebase have uncovered some of the limitations of the
official CUDA compiler (nvcc) offered by NVIDIA when dealing with some C++ constructs …
official CUDA compiler (nvcc) offered by NVIDIA when dealing with some C++ constructs …
Guardian: Safe GPU Sharing in Multi-Tenant Environments
Modern GPU applications, such as machine learning (ML), can only partially utilize GPUs,
leading to GPU underutilization in cloud environments. Sharing GPUs across multiple …
leading to GPU underutilization in cloud environments. Sharing GPUs across multiple …
Cuda flux: A lightweight instruction profiler for cuda applications
GPUs are powerful, massively parallel processors, which require a vast amount of thread
parallelism to keep their thousands of execution units busy, and to tolerate latency when …
parallelism to keep their thousands of execution units busy, and to tolerate latency when …