Parallel programming models for heterogeneous many-cores: a comprehensive survey

J Fang, C Huang, T Tang, Z Wang - CCF Transactions on High …, 2020 - Springer
Heterogeneous many-cores are now an integral part of modern computing systems ranging
from embedding systems to supercomputers. While heterogeneous many-core design offers …

A survey on techniques for cooperative CPU-GPU computing

K Raju, NN Chiplunkar - Sustainable Computing: Informatics and Systems, 2018 - Elsevier
Abstract Graphical Processing Unit provides massive parallelism due to the presence of
hundreds of cores. Usage of GPUs for general purpose computation (GPGPU) has resulted …

A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling

E Konstantinidis, Y Cotronis - Journal of Parallel and Distributed Computing, 2017 - Elsevier
Typically, the execution time of a kernel on a GPU is a difficult to predict measure as it
depends on a wide range of factors. Performance can be limited by either memory transfer …

A practical performance model for compute and memory bound GPU kernels

E Konstantinidis, Y Cotronis - 2015 23rd Euromicro …, 2015 - ieeexplore.ieee.org
Performance prediction of GPU kernels is generally a tedious procedure with unpredictable
results. In this paper, we provide a practical model for estimating performance of CUDA …

Auto-tuning streamed applications on intel xeon phi

P Zhang, J Fang, T Tang, C Yang… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org
Many-core accelerators, as represented by the XeonPhi coprocessors and GPGPUs, allow
software to exploit spatial and temporal sharing of computing resources to improve the …

OmniRPC: a Grid RPC system for parallel programming in cluster and Grid environment

M Sato, T Boku, D Takahashi - CCGrid 2003. 3rd IEEE/ACM …, 2003 - ieeexplore.ieee.org
We have designed and implemented a Grid RPC system called OmniRPC, for parallel
programming in cluster and grid environments. While OmniRPC inherits its API from Ninf, the …

Paralia: A performance aware runtime for auto-tuning linear algebra on heterogeneous systems

P Anastasiadis, N Papadopoulou, G Goumas… - ACM Transactions on …, 2023 - dl.acm.org
Dense linear algebra operations appear very frequently in high-performance computing
(HPC) applications, rendering their performance crucial to achieve optimal scalability. As …

A high-throughput DPI engine on GPU via algorithm/implementation co-optimization

CL Hsieh, L Vespa, N Weng - Journal of Parallel and Distributed …, 2016 - Elsevier
Abstract The Graphics Processing Unit (GPU) is a promising platform to implement Deep
Packet Inspection (DPI) due to the GPU's rich parallelism and programmability for high …

In-place transposition of rectangular matrices on accelerators

IJ Sung, J Gómez-Luna, JM González-Linares… - ACM SIGPLAN …, 2014 - dl.acm.org
Matrix transposition is an important algorithmic building block for many numeric algorithms
such as FFT. It has also been used to convert the storage layout of arrays. With more and …

[HTML][HTML] Online speech recognition using multichannel parallel acoustic score computation and deep neural network (DNN)-based voice-activity detector

YR Oh, K Park, JG Park - Applied Sciences, 2020 - mdpi.com
This paper aims to design an online, low-latency, and high-performance speech recognition
system using a bidirectional long short-term memory (BLSTM) acoustic model. To achieve …