FPGA HLS today: successes, challenges, and opportunities

J Cong, J Lau, G Liu, S Neuendorffer, P Pan… - ACM Transactions on …, 2022‏ - dl.acm.org
The year 2011 marked an important transition for FPGA high-level synthesis (HLS), as it
went from prototy** to deployment. A decade later, in this article, we assess the progress …

Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation

J Ansel, E Yang, H He, N Gimelshein, A Jain… - Proceedings of the 29th …, 2024‏ - dl.acm.org
This paper introduces two extensions to the popular PyTorch machine learning framework,
TorchDynamo and TorchInductor, which implement the torch. compile feature released in …

Pathways: Asynchronous distributed dataflow for ml

P Barham, A Chowdhery, J Dean… - Proceedings of …, 2022‏ - proceedings.mlsys.org
We present the design of a new large scale orchestration layer for accelerators. Our system,
Pathways, is explicitly designed to enable exploration of new systems and ML research …

Tensorir: An abstraction for automatic tensorized program optimization

S Feng, B Hou, H **, W Lin, J Shao, R Lai… - Proceedings of the 28th …, 2023‏ - dl.acm.org
Deploying deep learning models on various devices has become an important topic. The
wave of hardware specialization brings a diverse set of acceleration primitives for multi …

A survey on deep learning hardware accelerators for heterogeneous hpc platforms

C Silvano, D Ielmini, F Ferrandi, L Fiorin… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Recent trends in deep learning (DL) imposed hardware accelerators as the most viable
solution for several classes of high-performance computing (HPC) applications such as …

Challenges and opportunities to enable large-scale computing via heterogeneous chiplets

Z Yang, S Ji, X Chen, J Zhuang… - 2024 29th Asia and …, 2024‏ - ieeexplore.ieee.org
Fast-evolving artificial intelligence (AI) algorithms such as large language models have
been driving the ever-increasing computing demands in today's data centers …

Allo: A programming model for composable accelerator design

H Chen, N Zhang, S **ang, Z Zeng, M Dai… - Proceedings of the ACM …, 2024‏ - dl.acm.org
Special-purpose hardware accelerators are increasingly pivotal for sustaining performance
improvements in emerging applications, especially as the benefits of technology scaling …

{SecretFlow-SPU}: A performant and {User-Friendly} framework for {Privacy-Preserving} machine learning

J Ma, Y Zheng, J Feng, D Zhao, H Wu, W Fang… - 2023 USENIX Annual …, 2023‏ - usenix.org
With the increasing public attention to data security and privacy protection, privacy-
preserving machine learning (PPML) has become a research hotspot in recent years …

Apollo: Automatic partition-based operator fusion through layer by layer optimization

J Zhao, X Gao, R **a, Z Zhang… - Proceedings of …, 2022‏ - proceedings.mlsys.org
We study fusion for deep neural networks (DNNs) in a just-in-time (JIT) compilation
framework Apollo. It considers both memory-and compute-bound tensor operators for fusion …

AKG: automatic kernel generation for neural processing units using polyhedral transformations

J Zhao, B Li, W Nie, Z Geng, R Zhang, X Gao… - Proceedings of the …, 2021‏ - dl.acm.org
Existing tensor compilers have proven their effectiveness in deploying deep neural networks
on general-purpose hardware like CPU and GPU, but optimizing for neural processing units …