Optimization techniques for GPU programming

P Hijma, S Heldens, A Sclocco… - ACM Computing …, 2023 - dl.acm.org
In the past decade, Graphics Processing Units have played an important role in the field of
high-performance computing and they still advance new fields such as IoT, autonomous …

A Comprehensive Survey of Benchmarks for Improvement of Software's Non-Functional Properties

A Blot, J Petke - ACM Computing Surveys, 2025 - dl.acm.org
Despite recent increase in research on improvement of non-functional properties of
software, such as energy usage or program size, there is a lack of standard benchmarks for …

Futhark: purely functional GPU-programming with nested parallelism and in-place array updates

T Henriksen, NGW Serup, M Elsman… - Proceedings of the 38th …, 2017 - dl.acm.org
Futhark is a purely functional data-parallel array language that offers a machine-neutral
programming model and an optimising compiler that generates OpenCL code for GPUs …

A comprehensive performance comparison of CUDA and OpenCL

J Fang, AL Varbanescu, H Sips - … International Conference on …, 2011 - ieeexplore.ieee.org
This paper presents a comprehensive performance comparison between CUDA and
OpenCL. We have selected 16 benchmarks ranging from synthetic applications to real-world …

A performance analysis framework for identifying potential benefits in GPGPU applications

J Sim, A Dasgupta, H Kim, R Vuduc - Proceedings of the 17th ACM …, 2012 - dl.acm.org
Tuning code for GPGPU and other emerging many-core platforms is a challenge because
few models or tools can precisely pinpoint the root cause of performance bottlenecks. In this …

Reducing branch divergence in GPU programs

TD Han, TS Abdelrahman - Proceedings of the fourth workshop on …, 2011 - dl.acm.org
Branch divergence has a significant impact on the performance of GPU programs. We
propose two novel software-based optimizations, called iteration delaying and branch …

Optimizing memory efficiency for deep convolutional neural networks on GPUs

C Li, Y Yang, M Feng, S Chakradhar… - SC'16: Proceedings of …, 2016 - ieeexplore.ieee.org
Leveraging large data sets, deep Convolutional Neural Networks (CNNs) achieve state-of-
the-art recognition accuracy. Due to the substantial compute and memory operations …

On-the-fly elimination of dynamic irregularities for GPU computing

EZ Zhang, Y Jiang, Z Guo, K Tian, X Shen - ACM SIGPLAN Notices, 2011 - dl.acm.org
The power-efficient massively parallel Graphics Processing Units (GPUs) have become
increasingly influential for general-purpose computing over the past few years. However …

Many-thread aware prefetching mechanisms for GPGPU applications

J Lee, NB Lakshminarayana, H Kim… - 2010 43rd Annual IEEE …, 2010 - ieeexplore.ieee.org
We consider the problem of how to improve memory latency tolerance in massively
multithreaded GPGPUs when the thread-level parallelism of an application is not sufficient to …

Characterizing and improving the use of demand-fetched caches in GPUs

W Jia, KA Shaw, M Martonosi - … of the 26th ACM international conference …, 2012 - dl.acm.org
Initially introduced as special-purpose accelerators for games and graphics code, graphics
processing units (GPUs) have emerged as widely-used high-performance parallel …