Dfx: A low-latency multi-fpga appliance for accelerating transformer-based text generation

S Hong, S Moon, J Kim, S Lee, M Kim… - 2022 55th IEEE/ACM …, 2022 - ieeexplore.ieee.org
Transformer is a deep learning language model widely used for natural language
processing (NLP) services in datacenters. Among transformer models, Generative …

{CXL-ANNS}:{Software-Hardware} collaborative memory disaggregation and computation for {Billion-Scale} approximate nearest neighbor search

J Jang, H Choi, H Bae, S Lee, M Kwon… - 2023 USENIX Annual …, 2023 - usenix.org
We propose CXL-ANNS, a software-hardware collaborative approach to enable highly
scalable approximate nearest neighbor search (ANNS) services. To this end, we first …

Mtia: First generation silicon targeting meta's recommendation systems

A Firoozshahian, J Coburn, R Levenstein… - Proceedings of the 50th …, 2023 - dl.acm.org
Meta has traditionally relied on using CPU-based servers for running inference workloads,
specifically Deep Learning Recommendation Models (DLRM), but the increasing compute …

Magma: An optimization framework for map** multiple dnns on multiple accelerator cores

SC Kao, T Krishna - 2022 IEEE International Symposium on …, 2022 - ieeexplore.ieee.org
As Deep Learning continues to drive a variety of applications in edge and cloud data
centers, there is a growing trend towards building large accelerators with several sub …

Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology

B Hyun, T Kim, D Lee, M Rhu - 2024 IEEE International …, 2024 - ieeexplore.ieee.org
Processing-in-memory (PIM) has been explored for decades by computer architects, yet it
has never seen the light of day in real-world products due to its high design overheads and …

Hercules: Heterogeneity-aware inference serving for at-scale personalized recommendation

L Ke, U Gupta, M Hempstead, CJ Wu… - … Symposium on High …, 2022 - ieeexplore.ieee.org
Personalized recommendation is an important class of deep-learning applications that
powers a large collection of internet services and consumes a considerable amount of …

Scalability Limitations of Processing-in-Memory using Real System Evaluations

G Jonatan, H Cho, H Son, X Wu, N Livesay… - Proceedings of the …, 2024 - dl.acm.org
Processing-in-memory (PIM), where the compute is moved closer to the memory or the data,
has been widely explored to accelerate emerging workloads. Recently, different PIM-based …

Accelerating ML recommendation with over a thousand RISC-V/tensor processors on Esperanto's ET-SoC-1 chip

D Ditzel, R Espasa, N Aymerich, A Baum… - 2021 IEEE Hot Chips …, 2021 - ieeexplore.ieee.org
The ET-SoC-1 has over a thousand RISC-V processors on a single TSMC 7nm chip,
including:• 1088 energy-efficient ET-Minion 64-bit RISC-V in-order cores each with a …

Special session: Towards an agile design methodology for efficient, reliable, and secure ML systems

S Dave, A Marchisio, MA Hanif… - 2022 IEEE 40th VLSI …, 2022 - ieeexplore.ieee.org
The real-world use cases of Machine Learning (ML) have exploded over the past few years.
However, the current computing infrastructure is insufficient to support all real-world …

The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview

AA Khan, JPC De Lima, H Farzaneh… - arxiv preprint arxiv …, 2024 - arxiv.org
In today's data-centric world, where data fuels numerous application domains, with machine
learning at the forefront, handling the enormous volume of data efficiently in terms of time …