Cnvlutin: Ineffectual-neuron-free deep neural network computing

J Albericio, P Judd, T Hetherington, T Aamodt… - ACM SIGARCH …, 2016 - dl.acm.org
This work observes that a large fraction of the computations performed by Deep Neural
Networks (DNNs) are intrinsically ineffectual as they involve a multiplication where one of …

MIMD Programs Execution Support on SIMD Machines: A Holistic Survey

D Mustafa, R Alkhasawneh, F Obeidat… - IEEE Access, 2024 - ieeexplore.ieee.org
The Single Instruction Multiple Data (SIMD) architecture, supported by various high-
performance computing platforms, efficiently utilizes data-level parallelism. The SIMD model …

Scaling the power wall: a path to exascale

O Villa, DR Johnson, M Oconnor… - SC'14: Proceedings …, 2014 - ieeexplore.ieee.org
Modern scientific discovery is driven by an insatiable demand for computing performance.
The HPC community is targeting development of supercomputers able to sustain 1 ExaFlops …

Warped-compression: Enabling power efficient GPUs through register compression

S Lee, K Kim, G Koo, H Jeon, WW Ro… - ACM SIGARCH …, 2015 - dl.acm.org
This paper presents Warped-Compression, a warp-level register compression scheme for
reducing GPU power consumption. This work is motivated by the observation that the …

Flexible software profiling of gpu architectures

M Stephenson, SK Sastry Hari, Y Lee… - Proceedings of the …, 2015 - dl.acm.org
To aid application characterization and architecture design space exploration, researchers
and engineers have developed a wide range of tools for CPUs, including simulators …

Cudaadvisor: Llvm-based runtime profiling for modern gpus

D Shen, SL Song, A Li, X Liu - … of the 2018 International Symposium on …, 2018 - dl.acm.org
General-purpose GPUs have been widely utilized to accelerate parallel applications. Given
a relatively complex programming model and fast architecture evolution, producing efficient …

Partial control-flow linearization

S Moll, S Hack - ACM SIGPLAN Notices, 2018 - dl.acm.org
If-conversion is a fundamental technique for vectorization. It accounts for the fact that in a
SIMD program, several targets of a branch might be executed because of divergence …

A sparse probabilistic learning algorithm for real-time tracking

Blake, Cipolla - Proceedings Ninth IEEE International …, 2003 - ieeexplore.ieee.org
We address the problem of applying powerful pattern recognition algorithms based on
kernels to efficient visual tracking. Recently S. Avidan,(2001) has shown that object …

SPRING: A sparsity-aware reduced-precision monolithic 3D CNN accelerator architecture for training and inference

Y Yu, NK Jha - IEEE Transactions on Emerging Topics in …, 2020 - ieeexplore.ieee.org
Convolutional neural networks (CNNs) outperform traditional machine learning algorithms
across a wide range of applications, such as object recognition, image segmentation, and …

R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUs

D Ha, Y Oh, WW Ro - Proceedings of the 50th Annual International …, 2023 - dl.acm.org
A generally used GPU programming methodology is that adjacent threads access data in
neighbor or specific-stride memory addresses and perform computations with the fetched …