Kimi k1. 5: Scaling reinforcement learning with llms

K Team, A Du, B Gao, B **ng, C Jiang, C Chen… - arxiv preprint arxiv …, 2025 - arxiv.org
Language model pretraining with next token prediction has proved effective for scaling
compute but is limited to the amount of available training data. Scaling reinforcement …

Instinfer: In-storage attention offloading for cost-effective long-context llm inference

X Pan, E Li, Q Li, S Liang, Y Shan, K Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
The widespread of Large Language Models (LLMs) marks a significant milestone in
generative AI. Nevertheless, the increasing context length and batch size in offline LLM …

Preble: Efficient distributed prompt scheduling for llm serving

V Srivatsa, Z He, R Abhyankar, D Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Prompts to large language models (LLMs) have evolved beyond simple user questions. For
LLMs to solve complex problems, today's practices are to include domain-specific …

Batchllm: Optimizing large batched llm inference with global prefix sharing and throughput-oriented token batching

Z Zheng, X Ji, T Fang, F Zhou, C Liu, G Peng - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) increasingly play an important role in a wide range of
information processing and management tasks. Many of these tasks are performed in large …

Expertflow: Optimized expert activation and token allocation for efficient mixture-of-experts inference

X He, S Zhang, Y Wang, H Yin, Z Zeng, S Shi… - arxiv preprint arxiv …, 2024 - arxiv.org
Sparse Mixture of Experts (MoE) models, while outperforming dense Large Language
Models (LLMs) in terms of performance, face significant deployment challenges during …

Layerkv: Optimizing large language model serving with layer-wise kv cache management

Y **ong, H Wu, C Shao, Z Wang, R Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
The expanding context windows in large language models (LLMs) have greatly enhanced
their capabilities in various applications, but they also introduce significant challenges in …

Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries

J Kim, J Park, J Cho, D Papailiopoulos - arxiv preprint arxiv:2412.08890, 2024 - arxiv.org
We introduce Lexico, a novel KV cache compression method that leverages sparse coding
with a universal dictionary. Our key finding is that key-value cache in modern LLMs can be …

Context Parallelism for Scalable Million-Token Inference

A Yang, J Yang, A Ibrahim, X **e, B Tang… - arxiv preprint arxiv …, 2024 - arxiv.org
We present context parallelism for long-context large language model inference, which
achieves near-linear scaling for long-context prefill latency with up to 128 H100 GPUs …

Tackling the dynamicity in a production llm serving system with sota optimizations via hybrid prefill/decode/verify scheduling on efficient meta-kernels

M Song, X Tang, F Hou, J Li, W Wei, Y Ma… - arxiv preprint arxiv …, 2024 - arxiv.org
Meeting growing demands for low latency and cost efficiency in production-grade large
language model (LLM) serving systems requires integrating advanced optimization …

LLM Knowledge-Driven Target Prototype Learning for Few-Shot Segmentation

P Li, F Liu, L Jiao, S Li, X Liu, P Chen, L Li… - Knowledge-Based …, 2025 - Elsevier
Abstract Few-Shot Segmentation (FSS) aims to segment new class objects in a query image
with few support images. The prototype-based FSS methods first model a target prototype …