Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
A survey of techniques for architecting and managing GPU register file
S Mittal - IEEE Transactions on Parallel and Distributed …, 2016 - ieeexplore.ieee.org
To support their massively-multithreaded architecture, GPUs use very large register file (RF)
which has a capacity higher than even L1 and L2 caches. In total contrast, traditional CPUs …
which has a capacity higher than even L1 and L2 caches. In total contrast, traditional CPUs …
Convstencil: Transform stencil computation to matrix multiplication on tensor cores
Tensor Core Unit (TCU) is increasingly integrated into modern high-performance processors
to enhance matrix multiplication performance. However, constrained to its over-specification …
to enhance matrix multiplication performance. However, constrained to its over-specification …
Machine learning based auto-tuning for enhanced opencl performance portability
TL Falch, AC Elster - 2015 IEEE International Parallel and …, 2015 - ieeexplore.ieee.org
Heterogeneous computing, which combines devices with different architectures, is rising in
popularity, and promises increased performance combined with reduced energy …
popularity, and promises increased performance combined with reduced energy …
Warp-consolidation: A novel execution model for gpus
With the unprecedented development of compute capability and extension of memory
bandwidth on modern GPUs, parallel communication and synchronization soon becomes a …
bandwidth on modern GPUs, parallel communication and synchronization soon becomes a …
Machine learning‐based auto‐tuning for enhanced performance portability of OpenCL applications
TL Falch, AC Elster - Concurrency and Computation: Practice …, 2017 - Wiley Online Library
Heterogeneous computing, combining devices with different architectures such as CPUs
and GPUs, is rising in popularity and promises increased performance combined with …
and GPUs, is rising in popularity and promises increased performance combined with …
Gpu-unicache: Automatic code generation of spatial blocking for stencils on gpus
Spatial blocking is a critical memory-access optimization to efficiently exploit the computing
resources of parallel processors, such as many-core GPUs. By reusing cache-loaded data …
resources of parallel processors, such as many-core GPUs. By reusing cache-loaded data …
LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor Cores
Stencil computations play a pivotal role in numerous scientific and industrial applications,
yet their efficient execution on specialized hardware accelerators like Tensor Core Units …
yet their efficient execution on specialized hardware accelerators like Tensor Core Units …
Moirae: Generating High-Performance Composite Stencil Programs with Global Optimizations
X Liu, X Yang, K Ma, S Liu, K Zhang… - … Conference for High …, 2024 - ieeexplore.ieee.org
Stencil computation is one of the most universal computation motifs in scientific applications
such as weather prediction. Due to the complexity of scientific simulation, the stencil …
such as weather prediction. Due to the complexity of scientific simulation, the stencil …
Batched triangular dense linear algebra kernels for very small matrix sizes on GPUs
Batched dense linear algebra kernels are becoming ubiquitous in scientific applications,
ranging from tensor contractions in deep learning to data compression in hierarchical low …
ranging from tensor contractions in deep learning to data compression in hierarchical low …
Memory access optimization of high-order CFD stencil computations on GPU
S Wang, Z Li, Y Che - … , PDCAT 2020, Shenzhen, China, December 28–30 …, 2021 - Springer
Stencils computations are a class of computations commonly found in scientific and
engineering applications. They have relatively lower arithmetic intensity. Therefore, their …
engineering applications. They have relatively lower arithmetic intensity. Therefore, their …