Google Академія

J Fang, C Huang, T Tang, Z Wang - CCF Transactions on High …, 2020 - Springer

Heterogeneous many-cores are now an integral part of modern computing systems ranging
from embedding systems to supercomputers. While heterogeneous many-core design offers …

Зберегти Послатися Цитовано в 62 джерелах Пов’язані статті Кількість версій: 5

A survey on techniques for cooperative CPU-GPU computing

K Raju, NN Chiplunkar - Sustainable Computing: Informatics and Systems, 2018 - Elsevier

Abstract Graphical Processing Unit provides massive parallelism due to the presence of
hundreds of cores. Usage of GPUs for general purpose computation (GPGPU) has resulted …

Зберегти Послатися Цитовано в 43 джерелах Пов’язані статті Кількість версій: 2

A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling

E Konstantinidis, Y Cotronis - Journal of Parallel and Distributed Computing, 2017 - Elsevier

Typically, the execution time of a kernel on a GPU is a difficult to predict measure as it
depends on a wide range of factors. Performance can be limited by either memory transfer …

Зберегти Послатися Цитовано в 88 джерелах Пов’язані статті

[Free GPT-4]
[DeepSeek]

[PDF] researchgate.net

A practical performance model for compute and memory bound GPU kernels

E Konstantinidis, Y Cotronis - 2015 23rd Euromicro …, 2015 - ieeexplore.ieee.org

Performance prediction of GPU kernels is generally a tedious procedure with unpredictable
results. In this paper, we provide a practical model for estimating performance of CUDA …

Зберегти Послатися Цитовано в 46 джерелах Пов’язані статті Кількість версій: 3

[Free GPT-4]
[DeepSeek]

[PDF] github.io

Auto-tuning streamed applications on intel xeon phi

P Zhang, J Fang, T Tang, C Yang… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org

Many-core accelerators, as represented by the XeonPhi coprocessors and GPGPUs, allow
software to exploit spatial and temporal sharing of computing resources to improve the …

Зберегти Послатися Цитовано в 37 джерелах Пов’язані статті Кількість версій: 6

[Free GPT-4]
[DeepSeek]

[PDF] psu.edu

OmniRPC: a Grid RPC system for parallel programming in cluster and Grid environment

M Sato, T Boku, D Takahashi - CCGrid 2003. 3rd IEEE/ACM …, 2003 - ieeexplore.ieee.org

We have designed and implemented a Grid RPC system called OmniRPC, for parallel
programming in cluster and grid environments. While OmniRPC inherits its API from Ninf, the …

Зберегти Послатися Цитовано в 109 джерелах Пов’язані статті Кількість версій: 8

[Free GPT-4]
[DeepSeek]

[PDF] acm.org Full View

Paralia: A performance aware runtime for auto-tuning linear algebra on heterogeneous systems

P Anastasiadis, N Papadopoulou, G Goumas… - ACM Transactions on …, 2023 - dl.acm.org

Dense linear algebra operations appear very frequently in high-performance computing
(HPC) applications, rendering their performance crucial to achieve optimal scalability. As …

Зберегти Послатися Цитовано в 4 джерелах Пов’язані статті Кількість версій: 5

A high-throughput DPI engine on GPU via algorithm/implementation co-optimization

CL Hsieh, L Vespa, N Weng - Journal of Parallel and Distributed …, 2016 - Elsevier

Abstract The Graphics Processing Unit (GPU) is a promising platform to implement Deep
Packet Inspection (DPI) due to the GPU's rich parallelism and programmability for high …

Зберегти Послатися Цитовано в 32 джерелах Пов’язані статті Кількість версій: 2

[Free GPT-4]
[DeepSeek]

[PDF] illinois.edu

In-place transposition of rectangular matrices on accelerators

IJ Sung, J Gómez-Luna, JM González-Linares… - ACM SIGPLAN …, 2014 - dl.acm.org

Matrix transposition is an important algorithmic building block for many numeric algorithms
such as FFT. It has also been used to convert the storage layout of arrays. With more and …

Зберегти Послатися Цитовано в 34 джерелах Пов’язані статті Кількість версій: 9

[Free GPT-4]
[DeepSeek]

[HTML] mdpi.com

[HTML][HTML] Online speech recognition using multichannel parallel acoustic score computation and deep neural network (DNN)-based voice-activity detector

YR Oh, K Park, JG Park - Applied Sciences, 2020 - mdpi.com

This paper aims to design an online, low-latency, and high-performance speech recognition
system using a bidirectional long short-term memory (BLSTM) acoustic model. To achieve …

Зберегти Послатися Цитовано в 12 джерелах Пов’язані статті Кількість версій: 4 Кеш

Створити сповіщення

Послатися

Розширений пошук

Збережено в моїй бібліотеці

Improving GPU performance prediction with data transfer modeling

Parallel programming models for heterogeneous many-cores: a comprehensive survey

A survey on techniques for cooperative CPU-GPU computing

A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling

A practical performance model for compute and memory bound GPU kernels

Auto-tuning streamed applications on intel xeon phi

OmniRPC: a Grid RPC system for parallel programming in cluster and Grid environment

Paralia: A performance aware runtime for auto-tuning linear algebra on heterogeneous systems

A high-throughput DPI engine on GPU via algorithm/implementation co-optimization

In-place transposition of rectangular matrices on accelerators

[HTML][HTML] Online speech recognition using multichannel parallel acoustic score computation and deep neural network (DNN)-based voice-activity detector