Eagle-2: Faster inference of language models with dynamic draft trees

Y Li, F Wei, C Zhang, H Zhang - arxiv preprint arxiv:2406.16858, 2024 - arxiv.org
Inference with modern Large Language Models (LLMs) is expensive and time-consuming,
and speculative sampling has proven to be an effective solution. Most speculative sampling …

Speculative diffusion decoding: Accelerating language generation through diffusion

JK Christopher, BR Bartoldson, B Kailkhura… - arxiv preprint arxiv …, 2024 - arxiv.org
Speculative decoding has emerged as a widely adopted method to accelerate large
language model inference without sacrificing the quality of the model outputs. While this …

GRIFFIN: Effective Token Alignment for Faster Speculative Decoding

S Hu, J Li, X **e, Z Lu, KC Toh, P Zhou - arxiv preprint arxiv:2502.11018, 2025 - arxiv.org
Speculative decoding accelerates inference in large language models (LLMs) by generating
multiple draft tokens simultaneously. However, existing methods often struggle with token …

Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference

L Zhang, Z Zhang, B Xu, S Mei, D Li - arxiv preprint arxiv:2412.18934, 2024 - arxiv.org
Due to the high resource demands of Large Language Models (LLMs), achieving
widespread deployment on consumer-grade devices presents significant challenges …

C2T: A Classifier-Based Tree Construction Method in Speculative Decoding

F Huo, J Tan, K Zhang, X Cai, S Sun - arxiv preprint arxiv:2502.13652, 2025 - arxiv.org
The growing scale of Large Language Models (LLMs) has exacerbated inference latency
and computational costs. Speculative decoding methods, which aim to mitigate these issues …

WeInfer: Unleashing the Power of WebGPU on LLM Inference in Web Browsers

Z Chen, Y Ma, S Haiyang, M Liu - THE WEB CONFERENCE 2025 - openreview.net
Web-based large language model (LLM) has garnered significant attention from both
academia and industry due to its potential to combine the benefits of on-device computation …

[PDF][PDF] Speculative Diffusion Decoding for Accelerated Language Generation

JK Christopher, BR Bartoldson, T Ben-Nun, M Cardei… - neurips2024-enlsp.github.io
Speculative decoding has emerged as a widely adopted method to accelerate large
language model inference without sacrificing the quality of the model outputs. While this …

Polybasic Speculative Decoding Under a Theoretical Perspective

R Wang, H Li, Y Ma, X Zheng, F Chao, X **ao, R Ji - openreview.net
Speculative decoding has emerged as a critical technique for accelerating inference in large
language models, achieving significant speedups while ensuring consistency with the …

[ЦИТИРОВАНИЕ][C] Empowering Large Language Models to Edge Intelligence: A Survey of Edge Efficient LLMs and Techniques

R Wanga, Z Gaoa, L Zhanga, S Yuea, Z Gaoa