- Academic Search

S Park, J Choi, S Lee, U Kang - arxiv preprint arxiv:2401.15347, 2024 - arxiv.org

How can we compress language models without sacrificing accuracy? The number of
compression algorithms for language models is rapidly growing to benefit from remarkable …

Opslaan Citeren Geciteerd door 15 Verwante artikelen Alle 3 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Understanding the potential of fpga-based spatial acceleration for large language model inference

H Chen, J Zhang, Y Du, S **ang, Z Yue… - ACM Transactions on …, 2024 - dl.acm.org

Recent advancements in large language models (LLMs) boasting billions of parameters
have generated a significant demand for efficient deployment in inference workloads. While …

Opslaan Citeren Geciteerd door 23 Verwante artikelen Alle 5 versies

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Speed: Speculative pipelined execution for efficient decoding

C Hooper, S Kim, H Mohammadzadeh, H Genc… - arxiv preprint arxiv …, 2023 - arxiv.org

Generative Large Language Models (LLMs) based on the Transformer architecture have
recently emerged as a dominant foundation model for a wide range of Natural Language …

Opslaan Citeren Geciteerd door 21 Verwante artikelen Alle 3 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Chunkattention: Efficient self-attention with prefix-aware kv cache and two-phase partition

L Ye, Z Tao, Y Huang, Y Li - arxiv preprint arxiv:2402.15220, 2024 - arxiv.org

Self-attention is an essential component of large language models (LLM) but a significant
source of inference latency for long sequences. In multi-tenant LLM serving scenarios, the …

Opslaan Citeren Geciteerd door 17 Verwante artikelen Alle 4 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Magr: Weight magnitude reduction for enhancing post-training quantization

A Zhang, N Wang, Y Deng, X Li, Z Yang… - arxiv preprint arxiv …, 2024 - arxiv.org

In this paper, we present a simple optimization-based preprocessing technique called
Weight Magnitude Reduction (MagR) to improve the performance of post-training …

Opslaan Citeren Geciteerd door 2 Verwante artikelen Alle 5 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

SOFA: A compute-memory optimized sparsity accelerator via cross-stage coordinated tiling

H Wang, J Fang, X Tang, Z Yue, J Li… - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org

Benefiting from the self-attention mechanism, Transformer models have attained impressive
contextual comprehension capabilities for lengthy texts. The requirements of high …

Opslaan Citeren Geciteerd door 2 Verwante artikelen Alle 4 versies

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Zed: A generalized accelerator for variably sparse matrix computations in ml

P Dangi, Z Bai, R Juneja, D Wijerathne… - Proceedings of the 2024 …, 2024 - dl.acm.org

Modern Machine Learning (ML) models employ sparsity to mitigate storage and computation
costs; but it gives rise to irregular and unstructured sparse matrix operations that dominate …

Opslaan Citeren Geciteerd door 1 Verwante artikelen Alle 6 versies

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models

Y Chen, T Tang, E **ang, L Li, WX Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

In real world, large language models (LLMs) can serve as the assistant to help users
accomplish their jobs, and also support the development of advanced applications. For the …

Opslaan Citeren Geciteerd door 1 Verwante artikelen Alle 4 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Conformer-based speech recognition on extreme edge-computing devices

M Xu, A **, S Wang, M Su, T Ng, H Mason… - arxiv preprint arxiv …, 2023 - arxiv.org

With increasingly more powerful compute capabilities and resources in today's devices,
traditionally compute-intensive automatic speech recognition (ASR) has been moving from …

Opslaan Citeren Geciteerd door 2 Verwante artikelen Alle 3 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Foundations of Large Language Models

T **ao, J Zhu - arxiv preprint arxiv:2501.09223, 2025 - arxiv.org

This is a book about large language models. As indicated by the title, it primarily focuses on
foundational concepts rather than comprehensive coverage of all cutting-edge technologies …

Opslaan Citeren Geciteerd door 1 Verwante artikelen Alle 4 versies HTML-versie

Melding maken

Citeren

Geavanceerd zoeken

Opgeslagen in Mijn bibliotheek

Yakun Sophia Shao, and Amir Gholami. 2023. Full Stack Optimization of Transformer Inference:...

A comprehensive survey of compression algorithms for language models

Understanding the potential of fpga-based spatial acceleration for large language model inference

Speed: Speculative pipelined execution for efficient decoding

Chunkattention: Efficient self-attention with prefix-aware kv cache and two-phase partition

Magr: Weight magnitude reduction for enhancing post-training quantization

SOFA: A compute-memory optimized sparsity accelerator via cross-stage coordinated tiling

Zed: A generalized accelerator for variably sparse matrix computations in ml

Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models

Conformer-based speech recognition on extreme edge-computing devices

Foundations of Large Language Models