A Survey of Design and Optimization for Systolic Array-based DNN Accelerators
In recent years, it has been witnessed that the systolic array is a successful architecture for
DNN hardware accelerators. However, the design of systolic arrays also encountered many …
DNN hardware accelerators. However, the design of systolic arrays also encountered many …
AutoSA: A polyhedral compiler for high-performance systolic arrays on FPGA
While systolic array architectures have the potential to deliver tremendous performance, it is
notoriously challenging to customize an efficient systolic array processor for a target …
notoriously challenging to customize an efficient systolic array processor for a target …
CHARM: C omposing H eterogeneous A ccele R ators for M atrix Multiply on Versal ACAP Architecture
Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning
applications. To cope with the high computation demands of these applications …
applications. To cope with the high computation demands of these applications …
Transformations of high-level synthesis codes for high-performance computing
Spatial computing architectures promise a major stride in performance and energy efficiency
over the traditional load/store devices currently employed in large scale computing systems …
over the traditional load/store devices currently employed in large scale computing systems …
Sextans: A streaming accelerator for general-purpose sparse-matrix dense-matrix multiplication
Sparse-Matrix Dense-Matrix multiplication (SpMM) is the key operator for a wide range of
applications including scientific computing, graph processing, and deep learning …
applications including scientific computing, graph processing, and deep learning …
Co-design hardware and algorithm for vector search
Vector search has emerged as the foundation for large-scale information retrieval and
machine learning systems, with search engines like Google and Bing processing tens of …
machine learning systems, with search engines like Google and Bing processing tens of …
Extending high-level synthesis for task-parallel programs
C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-
programmable gate array (FPGA) accelerators in many application domains in recent years …
programmable gate array (FPGA) accelerators in many application domains in recent years …
SuSy: A programming model for productive construction of high-performance systolic arrays on FPGAs
Systolic algorithms are one of the killer applications on spatial architectures such as FPGAs
and CGRAs. However, it requires a tremendous amount of human effort to design and …
and CGRAs. However, it requires a tremendous amount of human effort to design and …
Combining dynamic & static scheduling in high-level synthesis
A central task in high-level synthesis is scheduling: the allocation of operations to clock
cycles. The classic approach to scheduling is static, in which each operation is mapped to a …
cycles. The classic approach to scheduling is static, in which each operation is mapped to a …
Fleetrec: Large-scale recommendation inference on hybrid gpu-fpga clusters
We present FleetRec, a high-performance and scalable recommendation inference system
within tight latency constraints. FleetRec takes advantage of heterogeneous hardware …
within tight latency constraints. FleetRec takes advantage of heterogeneous hardware …