Optimization techniques for GPU programming
In the past decade, Graphics Processing Units have played an important role in the field of
high-performance computing and they still advance new fields such as IoT, autonomous …
high-performance computing and they still advance new fields such as IoT, autonomous …
A Comprehensive Survey of Benchmarks for Improvement of Software's Non-Functional Properties
Despite recent increase in research on improvement of non-functional properties of
software, such as energy usage or program size, there is a lack of standard benchmarks for …
software, such as energy usage or program size, there is a lack of standard benchmarks for …
Futhark: purely functional GPU-programming with nested parallelism and in-place array updates
Futhark is a purely functional data-parallel array language that offers a machine-neutral
programming model and an optimising compiler that generates OpenCL code for GPUs …
programming model and an optimising compiler that generates OpenCL code for GPUs …
A comprehensive performance comparison of CUDA and OpenCL
This paper presents a comprehensive performance comparison between CUDA and
OpenCL. We have selected 16 benchmarks ranging from synthetic applications to real-world …
OpenCL. We have selected 16 benchmarks ranging from synthetic applications to real-world …
A performance analysis framework for identifying potential benefits in GPGPU applications
Tuning code for GPGPU and other emerging many-core platforms is a challenge because
few models or tools can precisely pinpoint the root cause of performance bottlenecks. In this …
few models or tools can precisely pinpoint the root cause of performance bottlenecks. In this …
Reducing branch divergence in GPU programs
TD Han, TS Abdelrahman - Proceedings of the fourth workshop on …, 2011 - dl.acm.org
Branch divergence has a significant impact on the performance of GPU programs. We
propose two novel software-based optimizations, called iteration delaying and branch …
propose two novel software-based optimizations, called iteration delaying and branch …
Optimizing memory efficiency for deep convolutional neural networks on GPUs
Leveraging large data sets, deep Convolutional Neural Networks (CNNs) achieve state-of-
the-art recognition accuracy. Due to the substantial compute and memory operations …
the-art recognition accuracy. Due to the substantial compute and memory operations …
On-the-fly elimination of dynamic irregularities for GPU computing
The power-efficient massively parallel Graphics Processing Units (GPUs) have become
increasingly influential for general-purpose computing over the past few years. However …
increasingly influential for general-purpose computing over the past few years. However …
Many-thread aware prefetching mechanisms for GPGPU applications
We consider the problem of how to improve memory latency tolerance in massively
multithreaded GPGPUs when the thread-level parallelism of an application is not sufficient to …
multithreaded GPGPUs when the thread-level parallelism of an application is not sufficient to …
Characterizing and improving the use of demand-fetched caches in GPUs
Initially introduced as special-purpose accelerators for games and graphics code, graphics
processing units (GPUs) have emerged as widely-used high-performance parallel …
processing units (GPUs) have emerged as widely-used high-performance parallel …