Advancing DSP into HPC, AI, and beyond: challenges, mechanisms, and future directions
Y Wang, C Li, C Liu, S Liu, Y Lei, J Zhang… - CCF Transactions on …, 2021 - Springer
Abstract Digital Signal Processors (DSPs) have been widely used in embedded domains,
delivering high performance with ultra-low power consumption. Such promises make it …
delivering high performance with ultra-low power consumption. Such promises make it …
Stash: Have your scratchpad and cache it too
Heterogeneous systems employ specialization for energy efficiency. Since data movement
is expected to be a dominant consumer of energy, these systems employ specialized …
is expected to be a dominant consumer of energy, these systems employ specialized …
ApproxHPVM: a portable compiler IR for accuracy-aware optimizations
We propose ApproxHPVM, a compiler IR and system designed to enable accuracy-aware
performance and energy tuning on heterogeneous systems with multiple compute units and …
performance and energy tuning on heterogeneous systems with multiple compute units and …
[PDF][PDF] Toward cache-friendly hardware accelerators
Increasing demand for power-efficient, high-performance computing has spurred a growing
number and diversity of hardware accelerators in mobile Systems on Chip (SoCs) as well as …
number and diversity of hardware accelerators in mobile Systems on Chip (SoCs) as well as …
A novel DSP architecture for scientific computing and deep learning
C Yang, S Chen, J Zhang, Z Lv, Z Wang - IEEE Access, 2019 - ieeexplore.ieee.org
Exascale computing requires accelerators with ultrahigh power efficiency. Digital signal
processors (DSPs), the most important embedded processors widely known for high power …
processors (DSPs), the most important embedded processors widely known for high power …
WASP: Exploiting GPU Pipeline Parallelism with Hardware-Accelerated Automatic Warp Specialization
Graphics processing units (GPUs) are an important class of parallel processors that offer
high compute throughput and memory bandwidth. GPUs are used in a variety of important …
high compute throughput and memory bandwidth. GPUs are used in a variety of important …
Coordinated DMA: improving the DRAM access efficiency for matrix multiplication
S Ma, Z Liu, S Chen, L Huang, Y Guo… - … on Parallel and …, 2019 - ieeexplore.ieee.org
High performance implementation of matrix multiplication is essential for scientific
computing. The memory access procedure is quite possible to be the bottleneck of matrix …
computing. The memory access procedure is quite possible to be the bottleneck of matrix …
An efficient direct memory access (DMA) controller for scientific computing accelerators
S Ma, L Huang, Y Lei, Y Guo… - 2019 IEEE International …, 2019 - ieeexplore.ieee.org
We design an efficient DMA controller for scientific computing accelerators. It supports
several flexible and powerful transfers, including reshape transfers, parameter linking …
several flexible and powerful transfers, including reshape transfers, parameter linking …
ELF: Maximizing memory-level parallelism for GPUs with coordinated warp and fetch scheduling
Graphics processing units (GPUs) are increasingly utilized as throughput engines in the
modern computer systems. GPUs rely on fast context switching between thousands of …
modern computer systems. GPUs rely on fast context switching between thousands of …
CIAO: Cache interference-aware throughput-oriented architecture and scheduling for GPUs
A modern GPU aims to simultaneously execute more warps for higher Thread-Level
Parallelism (TLP) and performance. When generating many memory requests, however …
Parallelism (TLP) and performance. When generating many memory requests, however …