Cnvlutin: Ineffectual-neuron-free deep neural network computing
This work observes that a large fraction of the computations performed by Deep Neural
Networks (DNNs) are intrinsically ineffectual as they involve a multiplication where one of …
Networks (DNNs) are intrinsically ineffectual as they involve a multiplication where one of …
MIMD Programs Execution Support on SIMD Machines: A Holistic Survey
D Mustafa, R Alkhasawneh, F Obeidat… - IEEE Access, 2024 - ieeexplore.ieee.org
The Single Instruction Multiple Data (SIMD) architecture, supported by various high-
performance computing platforms, efficiently utilizes data-level parallelism. The SIMD model …
performance computing platforms, efficiently utilizes data-level parallelism. The SIMD model …
Scaling the power wall: a path to exascale
Modern scientific discovery is driven by an insatiable demand for computing performance.
The HPC community is targeting development of supercomputers able to sustain 1 ExaFlops …
The HPC community is targeting development of supercomputers able to sustain 1 ExaFlops …
Warped-compression: Enabling power efficient GPUs through register compression
This paper presents Warped-Compression, a warp-level register compression scheme for
reducing GPU power consumption. This work is motivated by the observation that the …
reducing GPU power consumption. This work is motivated by the observation that the …
Flexible software profiling of gpu architectures
M Stephenson, SK Sastry Hari, Y Lee… - Proceedings of the …, 2015 - dl.acm.org
To aid application characterization and architecture design space exploration, researchers
and engineers have developed a wide range of tools for CPUs, including simulators …
and engineers have developed a wide range of tools for CPUs, including simulators …
Cudaadvisor: Llvm-based runtime profiling for modern gpus
General-purpose GPUs have been widely utilized to accelerate parallel applications. Given
a relatively complex programming model and fast architecture evolution, producing efficient …
a relatively complex programming model and fast architecture evolution, producing efficient …
Partial control-flow linearization
If-conversion is a fundamental technique for vectorization. It accounts for the fact that in a
SIMD program, several targets of a branch might be executed because of divergence …
SIMD program, several targets of a branch might be executed because of divergence …
A sparse probabilistic learning algorithm for real-time tracking
We address the problem of applying powerful pattern recognition algorithms based on
kernels to efficient visual tracking. Recently S. Avidan,(2001) has shown that object …
kernels to efficient visual tracking. Recently S. Avidan,(2001) has shown that object …
SPRING: A sparsity-aware reduced-precision monolithic 3D CNN accelerator architecture for training and inference
Convolutional neural networks (CNNs) outperform traditional machine learning algorithms
across a wide range of applications, such as object recognition, image segmentation, and …
across a wide range of applications, such as object recognition, image segmentation, and …
R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUs
A generally used GPU programming methodology is that adjacent threads access data in
neighbor or specific-stride memory addresses and perform computations with the fetched …
neighbor or specific-stride memory addresses and perform computations with the fetched …