NAS Parallel Benchmarks with CUDA and beyond

G Araujo, D Griebler, DA Rockenbach… - Software: Practice …, 2023 - Wiley Online Library
Abstract NAS Parallel Benchmarks (NPB) is a standard benchmark suite used in the
evaluation of parallel hardware and software. Several research efforts from academia have …

Efficient NAS parallel benchmark kernels with CUDA

GA de Araujo, D Griebler, M Danelutto… - 2020 28th Euromicro …, 2020 - ieeexplore.ieee.org
NAS Parallel Benchmarks (NPB) are one of the standard benchmark suites used to evaluate
parallel hardware and software. There are many research efforts trying to provide different …

Optimizing gpu register usage: Extensions to openacc and compiler optimizations

X Tian, D Khaldi, D Eachempati, R Xu… - 2016 45th …, 2016 - ieeexplore.ieee.org
Using compiler directives to program accelerator-based systems through APIs such as
OpenACC or OpenMP has increasingly gained popularity due to the portability and …

Automatically exploiting the memory hierarchy of gpus through just-in-time compilation

M Papadimitriou, J Fumero, A Stratikopoulos… - Proceedings of the 17th …, 2021 - dl.acm.org
Although Graphics Processing Units (GPUs) have become pervasive for data-parallel
workloads, the efficient exploitation of their tiered memory hierarchy requires explicit …

Exploring OpenMP GPU Offloading for Implementing Convolutional Neural Networks

K Yan, Y Shi, Y Yan - Proceedings of the 14th International Workshop on …, 2023 - dl.acm.org
Computing on heterogeneous architecture involving CPUs and accelerators is now a
popular choice of parallel computing. As a directive-based programming model, OpenMP …

[PDF][PDF] Optimizing the Performance of Directive-based Programming Model for GPGPUs

R Xu - 2016 - uh-ir.tdl.org
Accelerators have been deployed on most major HPC systems. They are considered to
improve the performance of many applications. Accelerators such as GPUs have an …

the th International Workshop on Programming Models and Applications for Multicores and Manycores

ACM SIGPLAN, ACM SIGHPC - dl.acm.org
Matrix computations are widely used in increasing sizes and complexity in scientific
computing and engineering. But current matrix language implementations lack programmer …

Optimizing apples lossless audio codec algorithm using NVIDIA CUDA

R Ahmed, MS Islam - 2016 - dspace.bracu.ac.bd
As majority of the compression algorithms are implementations for CPU architecture, the
primary focus of our work is to exploit the opportunities of GPU parallelism in audio …

An open-source solution to performance portability for Summit and Sierra supercomputers

GT Bercea, A Bataev, AE Eichenberger… - IBM Journal of …, 2019 - ieeexplore.ieee.org
Programming models that use a higher level of abstraction to express parallelism can target
both CPUs and any attached devices, alleviating the maintainability and portability concerns …

[PDF][PDF] Implementação CUDA dos Kernels NPB

GA de Araújo, D Griebler… - Anais da 20a Escola …, 2020 - repositorio.pucrs.br
NAS Parallel Benchmarks (NPB) é um conjunto de benchmarks utilizado para avaliar
hardware e software, que ao longo dos anos foi portado para diferentes frameworks …