A comprehensive survey of compression algorithms for language models

S Park, J Choi, S Lee, U Kang - arxiv preprint arxiv:2401.15347, 2024 - arxiv.org
How can we compress language models without sacrificing accuracy? The number of
compression algorithms for language models is rapidly growing to benefit from remarkable …

Understanding the potential of fpga-based spatial acceleration for large language model inference

H Chen, J Zhang, Y Du, S **ang, Z Yue… - ACM Transactions on …, 2024 - dl.acm.org
Recent advancements in large language models (LLMs) boasting billions of parameters
have generated a significant demand for efficient deployment in inference workloads. While …

Speed: Speculative pipelined execution for efficient decoding

C Hooper, S Kim, H Mohammadzadeh, H Genc… - arxiv preprint arxiv …, 2023 - arxiv.org
Generative Large Language Models (LLMs) based on the Transformer architecture have
recently emerged as a dominant foundation model for a wide range of Natural Language …

Chunkattention: Efficient self-attention with prefix-aware kv cache and two-phase partition

L Ye, Z Tao, Y Huang, Y Li - arxiv preprint arxiv:2402.15220, 2024 - arxiv.org
Self-attention is an essential component of large language models (LLM) but a significant
source of inference latency for long sequences. In multi-tenant LLM serving scenarios, the …

Magr: Weight magnitude reduction for enhancing post-training quantization

A Zhang, N Wang, Y Deng, X Li, Z Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we present a simple optimization-based preprocessing technique called
Weight Magnitude Reduction (MagR) to improve the performance of post-training …

SOFA: A compute-memory optimized sparsity accelerator via cross-stage coordinated tiling

H Wang, J Fang, X Tang, Z Yue, J Li… - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org
Benefiting from the self-attention mechanism, Transformer models have attained impressive
contextual comprehension capabilities for lengthy texts. The requirements of high …

Zed: A generalized accelerator for variably sparse matrix computations in ml

P Dangi, Z Bai, R Juneja, D Wijerathne… - Proceedings of the 2024 …, 2024 - dl.acm.org
Modern Machine Learning (ML) models employ sparsity to mitigate storage and computation
costs; but it gives rise to irregular and unstructured sparse matrix operations that dominate …

Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models

Y Chen, T Tang, E **ang, L Li, WX Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
In real world, large language models (LLMs) can serve as the assistant to help users
accomplish their jobs, and also support the development of advanced applications. For the …

Conformer-based speech recognition on extreme edge-computing devices

M Xu, A **, S Wang, M Su, T Ng, H Mason… - arxiv preprint arxiv …, 2023 - arxiv.org
With increasingly more powerful compute capabilities and resources in today's devices,
traditionally compute-intensive automatic speech recognition (ASR) has been moving from …

Foundations of Large Language Models

T **ao, J Zhu - arxiv preprint arxiv:2501.09223, 2025 - arxiv.org
This is a book about large language models. As indicated by the title, it primarily focuses on
foundational concepts rather than comprehensive coverage of all cutting-edge technologies …