GraphLily: Accelerating graph linear algebra on HBM-equipped FPGAs
Graph processing is typically memory bound due to low compute to memory access ratio
and irregular data access pattern. The emerging high-bandwidth memory (HBM) delivers …
and irregular data access pattern. The emerging high-bandwidth memory (HBM) delivers …
High-performance sparse linear algebra on hbm-equipped fpgas using hls: A case study on spmv
Sparse linear algebra operators are memory bound due to low compute to memory access
ratio and irregular data access patterns. The exceptional bandwidth improvement provided …
ratio and irregular data access patterns. The exceptional bandwidth improvement provided …
Sparse-TPU: Adapting systolic arrays for sparse matrices
While systolic arrays are widely used for dense-matrix operations, they are seldom used for
sparse-matrix operations. In this paper, we show how a systolic array of Multiply-and …
sparse-matrix operations. In this paper, we show how a systolic array of Multiply-and …
SparseAdapt: Runtime control for sparse linear algebra on a reconfigurable accelerator
Dynamic adaptation is a post-silicon optimization technique that adapts the hardware to
workload phases. However, current adaptive approaches are oblivious to implicit phases …
workload phases. However, current adaptive approaches are oblivious to implicit phases …
Occamy: A 432-Core Dual-Chiplet Dual-HBM2E 768-DP-GFLOP/s RISC-V System for 8-to-64-bit Dense and Sparse Computing in 12-nm FinFET
Machine learning (ML) and high-performance computing (HPC) applications increasingly
combine dense and sparse memory access computations to maximize storage efficiency …
combine dense and sparse memory access computations to maximize storage efficiency …
Transmuter: Bridging the efficiency gap using memory and dataflow reconfiguration
With the end of Dennard scaling and Moore's law, it is becoming increasingly difficult to build
hardware for emerging applications that meet power and performance targets, while …
hardware for emerging applications that meet power and performance targets, while …
Versa: A 36-core systolic multiprocessor with dynamically reconfigurable interconnect and memory
We present Versa, an energy-efficient 36-core systolic multiprocessor with dynamically
reconfigurable interconnects and memory. Versa leverages reconfigurable functional units …
reconfigurable interconnects and memory. Versa leverages reconfigurable functional units …
A 108-nW 0.8-mm2 Analog Voice Activity Detector Featuring a Time-Domain CNN With Sparsity-Aware Computation and Sparsified Quantization in 28-nm CMOS
This article reports a passive analog feature extractor for realizing an area-and-power-
efficient voice activity detector (VAD) for voice-control edge devices. It features a switched …
efficient voice activity detector (VAD) for voice-control edge devices. It features a switched …
OnSRAM: Efficient inter-node on-chip scratchpad management in deep learning accelerators
Hardware acceleration of Artificial Intelligence (AI) workloads has gained widespread
popularity with its potential to deliver unprecedented performance and efficiency. An …
popularity with its potential to deliver unprecedented performance and efficiency. An …
DAP: A 507-GMACs/J 256-Core Domain Adaptive Processor for Wireless Communication and Linear Algebra Kernels in 12-nm FINFET
We present domain adaptive processor (), a programmable systolic-array processor
designed for wireless communication and linear algebra workloads. uses a globally …
designed for wireless communication and linear algebra workloads. uses a globally …