Google Академик

S Mittal - IEEE Transactions on Parallel and Distributed …, 2016 - ieeexplore.ieee.org

To support their massively-multithreaded architecture, GPUs use very large register file (RF)
which has a capacity higher than even L1 and L2 caches. In total contrast, traditional CPUs …

Сачувај Цитирај 57 пута наведен Сродни чланци Све верзије (9)

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Convstencil: Transform stencil computation to matrix multiplication on tensor cores

Y Chen, K Li, Y Wang, D Bai, L Wang, L Ma… - Proceedings of the 29th …, 2024 - dl.acm.org

Tensor Core Unit (TCU) is increasingly integrated into modern high-performance processors
to enhance matrix multiplication performance. However, constrained to its over-specification …

Сачувај Цитирај 6 пута наведен Сродни чланци

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Machine learning based auto-tuning for enhanced opencl performance portability

TL Falch, AC Elster - 2015 IEEE International Parallel and …, 2015 - ieeexplore.ieee.org

Heterogeneous computing, which combines devices with different architectures, is rising in
popularity, and promises increased performance combined with reduced energy …

Сачувај Цитирај 60 пута наведен Сродни чланци Све верзије (6)

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Warp-consolidation: A novel execution model for gpus

A Li, W Liu, L Wang, K Barker, SL Song - Proceedings of the 2018 …, 2018 - dl.acm.org

With the unprecedented development of compute capability and extension of memory
bandwidth on modern GPUs, parallel communication and synchronization soon becomes a …

Сачувај Цитирај 36 пута наведен Сродни чланци Све верзије (2)

Machine learning‐based auto‐tuning for enhanced performance portability of OpenCL applications

TL Falch, AC Elster - Concurrency and Computation: Practice …, 2017 - Wiley Online Library

Heterogeneous computing, combining devices with different architectures such as CPUs
and GPUs, is rising in popularity and promises increased performance combined with …

Сачувај Цитирај 41 пута наведен Сродни чланци

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Gpu-unicache: Automatic code generation of spatial blocking for stencils on gpus

K Hou, H Wang, W Feng - Proceedings of the computing frontiers …, 2017 - dl.acm.org

Spatial blocking is a critical memory-access optimization to efficiently exploit the computing
resources of parallel processors, such as many-core GPUs. By reusing cache-loaded data …

Сачувај Цитирај 29 пута наведен Сродни чланци Све верзије (5)

LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor Cores

Y Zhang, K Li, L Yuan, J Cheng… - … Conference for High …, 2024 - ieeexplore.ieee.org

Stencil computations play a pivotal role in numerous scientific and industrial applications,
yet their efficient execution on specialized hardware accelerators like Tensor Core Units …

Сачувај Цитирај 1 пута наведен Сродни чланци Све верзије (3)

Moirae: Generating High-Performance Composite Stencil Programs with Global Optimizations

X Liu, X Yang, K Ma, S Liu, K Zhang… - … Conference for High …, 2024 - ieeexplore.ieee.org

Stencil computation is one of the most universal computation motifs in scientific applications
such as weather prediction. Due to the complexity of scientific simulation, the stencil …

Сачувај Цитирај Сродни чланци Све верзије (3)

[Free GPT-4]
[DeepSeek]

[PDF] kaust.edu.sa

Batched triangular dense linear algebra kernels for very small matrix sizes on GPUs

A Charara, D Keyes, H Ltaief - ACM Transactions on Mathematical …, 2019 - dl.acm.org

Batched dense linear algebra kernels are becoming ubiquitous in scientific applications,
ranging from tensor contractions in deep learning to data compression in hierarchical low …

Сачувај Цитирај 18 пута наведен Сродни чланци Све верзије (7)

Memory access optimization of high-order CFD stencil computations on GPU

S Wang, Z Li, Y Che - … , PDCAT 2020, Shenzhen, China, December 28–30 …, 2021 - Springer

Stencils computations are a class of computations commonly found in scientific and
engineering applications. They have relatively lower arithmetic intensity. Therefore, their …

Сачувај Цитирај 5 пута наведен Сродни чланци Све верзије (2)

Направи обавештење

Цитирај

Напредна претрага

Сачувано у мојој библиотеци

Register caching for stencil computations on GPUs

A survey of techniques for architecting and managing GPU register file

Convstencil: Transform stencil computation to matrix multiplication on tensor cores

Machine learning based auto-tuning for enhanced opencl performance portability

Warp-consolidation: A novel execution model for gpus

Machine learning‐based auto‐tuning for enhanced performance portability of OpenCL applications

Gpu-unicache: Automatic code generation of spatial blocking for stencils on gpus

LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor Cores

Moirae: Generating High-Performance Composite Stencil Programs with Global Optimizations

Batched triangular dense linear algebra kernels for very small matrix sizes on GPUs

Memory access optimization of high-order CFD stencil computations on GPU