GraphLily: Accelerating graph linear algebra on HBM-equipped FPGAs

Y Hu, Y Du, E Ustun, Z Zhang - 2021 IEEE/ACM International …, 2021 - ieeexplore.ieee.org
Graph processing is typically memory bound due to low compute to memory access ratio
and irregular data access pattern. The emerging high-bandwidth memory (HBM) delivers …

High-performance sparse linear algebra on hbm-equipped fpgas using hls: A case study on spmv

Y Du, Y Hu, Z Zhou, Z Zhang - Proceedings of the 2022 ACM/SIGDA …, 2022 - dl.acm.org
Sparse linear algebra operators are memory bound due to low compute to memory access
ratio and irregular data access patterns. The exceptional bandwidth improvement provided …

Sparse-TPU: Adapting systolic arrays for sparse matrices

X He, S Pal, A Amarnath, S Feng, DH Park… - Proceedings of the 34th …, 2020 - dl.acm.org
While systolic arrays are widely used for dense-matrix operations, they are seldom used for
sparse-matrix operations. In this paper, we show how a systolic array of Multiply-and …

SparseAdapt: Runtime control for sparse linear algebra on a reconfigurable accelerator

S Pal, A Amarnath, S Feng, M O'Boyle… - MICRO-54: 54th Annual …, 2021 - dl.acm.org
Dynamic adaptation is a post-silicon optimization technique that adapts the hardware to
workload phases. However, current adaptive approaches are oblivious to implicit phases …

Occamy: A 432-Core Dual-Chiplet Dual-HBM2E 768-DP-GFLOP/s RISC-V System for 8-to-64-bit Dense and Sparse Computing in 12-nm FinFET

P Scheffler, T Benz, V Potocnik… - IEEE Journal of Solid …, 2025 - ieeexplore.ieee.org
Machine learning (ML) and high-performance computing (HPC) applications increasingly
combine dense and sparse memory access computations to maximize storage efficiency …

Transmuter: Bridging the efficiency gap using memory and dataflow reconfiguration

S Pal, S Feng, D Park, S Kim, A Amarnath… - Proceedings of the …, 2020 - dl.acm.org
With the end of Dennard scaling and Moore's law, it is becoming increasingly difficult to build
hardware for emerging applications that meet power and performance targets, while …

Versa: A 36-core systolic multiprocessor with dynamically reconfigurable interconnect and memory

S Kim, M Fayazi, A Daftardar, KY Chen… - IEEE Journal of Solid …, 2022 - ieeexplore.ieee.org
We present Versa, an energy-efficient 36-core systolic multiprocessor with dynamically
reconfigurable interconnects and memory. Versa leverages reconfigurable functional units …

A 108-nW 0.8-mm2 Analog Voice Activity Detector Featuring a Time-Domain CNN With Sparsity-Aware Computation and Sparsified Quantization in 28-nm CMOS

F Chen, KF Un, WH Yu, PI Mak… - IEEE Journal of Solid …, 2022 - ieeexplore.ieee.org
This article reports a passive analog feature extractor for realizing an area-and-power-
efficient voice activity detector (VAD) for voice-control edge devices. It features a switched …

OnSRAM: Efficient inter-node on-chip scratchpad management in deep learning accelerators

S Pal, S Venkataramani, V Srinivasan… - ACM Transactions on …, 2022 - dl.acm.org
Hardware acceleration of Artificial Intelligence (AI) workloads has gained widespread
popularity with its potential to deliver unprecedented performance and efficiency. An …

DAP: A 507-GMACs/J 256-Core Domain Adaptive Processor for Wireless Communication and Linear Algebra Kernels in 12-nm FINFET

KY Chen, CS Yang, YH Sun, CW Tseng… - IEEE Journal of Solid …, 2024 - ieeexplore.ieee.org
We present domain adaptive processor (), a programmable systolic-array processor
designed for wireless communication and linear algebra workloads. uses a globally …