- Academic Search

K Team, A Du, B Gao, B **ng, C Jiang, C Chen… - arxiv preprint arxiv …, 2025 - arxiv.org

Language model pretraining with next token prediction has proved effective for scaling
compute but is limited to the amount of available training data. Scaling reinforcement …

Uložit Citovat Počet citací tohoto článku: 8 Související články Všechny verze (počet: 4) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Instinfer: In-storage attention offloading for cost-effective long-context llm inference

X Pan, E Li, Q Li, S Liang, Y Shan, K Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org

The widespread of Large Language Models (LLMs) marks a significant milestone in
generative AI. Nevertheless, the increasing context length and batch size in offline LLM …

Uložit Citovat Počet citací tohoto článku: 9 Související články Všechny verze (počet: 3) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Preble: Efficient distributed prompt scheduling for llm serving

V Srivatsa, Z He, R Abhyankar, D Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Prompts to large language models (LLMs) have evolved beyond simple user questions. For
LLMs to solve complex problems, today's practices are to include domain-specific …

Uložit Citovat Počet citací tohoto článku: 4 Související články Všechny verze (počet: 4) Hledat knihovnu Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Batchllm: Optimizing large batched llm inference with global prefix sharing and throughput-oriented token batching

Z Zheng, X Ji, T Fang, F Zhou, C Liu, G Peng - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) increasingly play an important role in a wide range of
information processing and management tasks. Many of these tasks are performed in large …

Uložit Citovat Počet citací tohoto článku: 3 Související články Všechny verze (počet: 2) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Expertflow: Optimized expert activation and token allocation for efficient mixture-of-experts inference

X He, S Zhang, Y Wang, H Yin, Z Zeng, S Shi… - arxiv preprint arxiv …, 2024 - arxiv.org

Sparse Mixture of Experts (MoE) models, while outperforming dense Large Language
Models (LLMs) in terms of performance, face significant deployment challenges during …

Uložit Citovat Počet citací tohoto článku: 2 Související články Všechny verze (počet: 2) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Layerkv: Optimizing large language model serving with layer-wise kv cache management

Y **ong, H Wu, C Shao, Z Wang, R Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

The expanding context windows in large language models (LLMs) have greatly enhanced
their capabilities in various applications, but they also introduce significant challenges in …

Uložit Citovat Počet citací tohoto článku: 3 Související články Všechny verze (počet: 2) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries

J Kim, J Park, J Cho, D Papailiopoulos - arxiv preprint arxiv:2412.08890, 2024 - arxiv.org

We introduce Lexico, a novel KV cache compression method that leverages sparse coding
with a universal dictionary. Our key finding is that key-value cache in modern LLMs can be …

Uložit Citovat Počet citací tohoto článku: 1 Související články Všechny verze (počet: 2) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Context Parallelism for Scalable Million-Token Inference

A Yang, J Yang, A Ibrahim, X **e, B Tang… - arxiv preprint arxiv …, 2024 - arxiv.org

We present context parallelism for long-context large language model inference, which
achieves near-linear scaling for long-context prefill latency with up to 128 H100 GPUs …

Uložit Citovat Počet citací tohoto článku: 1 Související články Všechny verze (počet: 4) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Tackling the dynamicity in a production llm serving system with sota optimizations via hybrid prefill/decode/verify scheduling on efficient meta-kernels

M Song, X Tang, F Hou, J Li, W Wei, Y Ma… - arxiv preprint arxiv …, 2024 - arxiv.org

Meeting growing demands for low latency and cost efficiency in production-grade large
language model (LLM) serving systems requires integrating advanced optimization …

Uložit Citovat Počet citací tohoto článku: 1 Související články Všechny verze (počet: 2) Zobrazit jako HTML

LLM Knowledge-Driven Target Prototype Learning for Few-Shot Segmentation

P Li, F Liu, L Jiao, S Li, X Liu, P Chen, L Li… - Knowledge-Based …, 2025 - Elsevier

Abstract Few-Shot Segmentation (FSS) aims to segment new class objects in a query image
with few support images. The prototype-based FSS methods first model a target prototype …

Uložit Citovat Související články Všechny verze (počet: 2)

Vytvořit upozornění

Citovat

Rozšířené vyhledávání

Uloženo do Mojí knihovny

Mooncake: A kvcache-centric disaggregated architecture for llm serving

Kimi k1. 5: Scaling reinforcement learning with llms

Instinfer: In-storage attention offloading for cost-effective long-context llm inference

Preble: Efficient distributed prompt scheduling for llm serving

Batchllm: Optimizing large batched llm inference with global prefix sharing and throughput-oriented token batching

Expertflow: Optimized expert activation and token allocation for efficient mixture-of-experts inference

Layerkv: Optimizing large language model serving with layer-wise kv cache management

Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries

Context Parallelism for Scalable Million-Token Inference

Tackling the dynamicity in a production llm serving system with sota optimizations via hybrid prefill/decode/verify scheduling on efficient meta-kernels

LLM Knowledge-Driven Target Prototype Learning for Few-Shot Segmentation