A Survey of Design and Optimization for Systolic Array-based DNN Accelerators

R Xu, S Ma, Y Guo, D Li - ACM Computing Surveys, 2023 - dl.acm.org
In recent years, it has been witnessed that the systolic array is a successful architecture for
DNN hardware accelerators. However, the design of systolic arrays also encountered many …

AutoSA: A polyhedral compiler for high-performance systolic arrays on FPGA

J Wang, L Guo, J Cong - The 2021 ACM/SIGDA International Symposium …, 2021 - dl.acm.org
While systolic array architectures have the potential to deliver tremendous performance, it is
notoriously challenging to customize an efficient systolic array processor for a target …

CHARM: C omposing H eterogeneous A ccele R ators for M atrix Multiply on Versal ACAP Architecture

J Zhuang, J Lau, H Ye, Z Yang, Y Du, J Lo… - Proceedings of the …, 2023 - dl.acm.org
Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning
applications. To cope with the high computation demands of these applications …

Transformations of high-level synthesis codes for high-performance computing

J de Fine Licht, M Besta, S Meierhans… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Spatial computing architectures promise a major stride in performance and energy efficiency
over the traditional load/store devices currently employed in large scale computing systems …

Sextans: A streaming accelerator for general-purpose sparse-matrix dense-matrix multiplication

L Song, Y Chi, A Sohrabizadeh, Y Choi, J Lau… - Proceedings of the …, 2022 - dl.acm.org
Sparse-Matrix Dense-Matrix multiplication (SpMM) is the key operator for a wide range of
applications including scientific computing, graph processing, and deep learning …

Co-design hardware and algorithm for vector search

W Jiang, S Li, Y Zhu, J de Fine Licht, Z He… - Proceedings of the …, 2023 - dl.acm.org
Vector search has emerged as the foundation for large-scale information retrieval and
machine learning systems, with search engines like Google and Bing processing tens of …

Extending high-level synthesis for task-parallel programs

Y Chi, L Guo, J Lau, Y Choi, J Wang… - 2021 IEEE 29th Annual …, 2021 - ieeexplore.ieee.org
C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-
programmable gate array (FPGA) accelerators in many application domains in recent years …

SuSy: A programming model for productive construction of high-performance systolic arrays on FPGAs

YH Lai, H Rong, S Zheng, W Zhang, X Cui… - Proceedings of the 39th …, 2020 - dl.acm.org
Systolic algorithms are one of the killer applications on spatial architectures such as FPGAs
and CGRAs. However, it requires a tremendous amount of human effort to design and …

Combining dynamic & static scheduling in high-level synthesis

J Cheng, L Josipovic, GA Constantinides… - Proceedings of the …, 2020 - dl.acm.org
A central task in high-level synthesis is scheduling: the allocation of operations to clock
cycles. The classic approach to scheduling is static, in which each operation is mapped to a …

Fleetrec: Large-scale recommendation inference on hybrid gpu-fpga clusters

W Jiang, Z He, S Zhang, K Zeng, L Feng… - Proceedings of the 27th …, 2021 - dl.acm.org
We present FleetRec, a high-performance and scalable recommendation inference system
within tight latency constraints. FleetRec takes advantage of heterogeneous hardware …