Adaptable logical control for large language models

H Zhang, PN Kung, M Yoshida… - Advances in …, 2025 - proceedings.neurips.cc
Despite the success of Large Language Models (LLMs) on various tasks following human
instructions, controlling model generation to follow strict constraints at inference time poses …

Optimizing instructions and demonstrations for multi-stage language model programs

K Opsahl-Ong, MJ Ryan, J Purtell, D Broman… - arxiv preprint arxiv …, 2024 - arxiv.org
Language Model Programs, ie sophisticated pipelines of modular language model (LM)
calls, are increasingly advancing NLP tasks, but they require crafting prompts that are jointly …

Kimi k1. 5: Scaling reinforcement learning with llms

K Team, A Du, B Gao, B **ng, C Jiang, C Chen… - arxiv preprint arxiv …, 2025 - arxiv.org
Language model pretraining with next token prediction has proved effective for scaling
compute but is limited to the amount of available training data. Scaling reinforcement …

Stateful large language model serving with pensieve

L Yu, J Lin, J Li - arxiv preprint arxiv:2312.05516, 2023 - arxiv.org
Large Language Models (LLMs) are wildly popular today and it is important to serve them
efficiently. Existing LLM serving systems are stateless across requests. Consequently, when …

vattention: Dynamic memory management for serving llms without pagedattention

R Prabhu, A Nayak, J Mohan, R Ramjee… - arxiv preprint arxiv …, 2024 - arxiv.org
Efficient management of GPU memory is essential for high throughput LLM inference. Prior
systems used to reserve KV-cache memory ahead-of-time that resulted in wasted capacity …

Neo: Saving gpu memory crisis with cpu offloading for online llm inference

X Jiang, Y Zhou, S Cao, I Stoica, M Yu - arxiv preprint arxiv:2411.01142, 2024 - arxiv.org
Online LLM inference powers many exciting applications such as intelligent chatbots and
autonomous agents. Modern LLM inference engines widely rely on request batching to …

Structuredrag: Json response formatting with large language models

C Shorten, C Pierse, TB Smith, E Cardenas… - arxiv preprint arxiv …, 2024 - arxiv.org
The ability of Large Language Models (LLMs) to generate structured outputs, such as JSON,
is crucial for their use in Compound AI Systems. However, evaluating and improving this …

User Behavior Simulation with Large Language Model-based Agents

L Wang, J Zhang, H Yang, ZY Chen, J Tang… - ACM Transactions on …, 2025 - dl.acm.org
Simulating high quality user behavior data has always been a fundamental yet challenging
problem in human-centered applications such as recommendation systems, social networks …

TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection

W Wu, Z Pan, C Wang, L Chen, Y Bai, K Fu… - arxiv preprint arxiv …, 2024 - arxiv.org
With the development of large language models (LLMs), the ability to handle longer contexts
has become a key capability for Web applications such as cross-document understanding …

UBER: Uncertainty-Based Evolution with Large Language Models for Automatic Heuristic Design

Z Chen, Z Zhou, Y Lu, R Xu, L Pan, Z Lan - arxiv preprint arxiv …, 2024 - arxiv.org
NP-hard problem-solving traditionally relies on heuristics, but manually crafting effective
heuristics for complex problems remains challenging. While recent work like FunSearch has …