[PDF][PDF] Efficiently Programming Large Language Models using SGLang.
Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …
generation calls, advanced prompting techniques, control flow, and structured …
[PDF][PDF] Sglang: Efficient execution of structured language model programs
Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …
generation calls, advanced prompting techniques, control flow, and structured …
Nanoflow: Towards optimal large language model serving throughput
The increasing usage of Large Language Models (LLMs) has resulted in a surging demand
for planet-scale serving systems, where tens of thousands of GPUs continuously serve …
for planet-scale serving systems, where tens of thousands of GPUs continuously serve …
Mnemosyne: Parallelization strategies for efficiently serving multi-million context length llm inference requests without approximations
As large language models (LLMs) evolve to handle increasingly longer contexts, serving
inference requests for context lengths in the range of millions of tokens presents unique …
inference requests for context lengths in the range of millions of tokens presents unique …
Taming {Throughput-Latency} Tradeoff in {LLM} Inference with {Sarathi-Serve}
Each LLM serving request goes through two phases. The first is prefill which processes the
entire input prompt and produces the first output token and the second is decode which …
entire input prompt and produces the first output token and the second is decode which …
Pod-attention: Unlocking full prefill-decode overlap for faster llm inference
Each request in LLM inference goes through two phases: compute-bound prefill and
memory-bandwidth-bound decode. To improve GPU utilization, recent systems use hybrid …
memory-bandwidth-bound decode. To improve GPU utilization, recent systems use hybrid …
AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding
This paper introduces AdaServe, the first LLM serving system to support SLO customization
through fine-grained speculative decoding. AdaServe leverages the logits of a draft model to …
through fine-grained speculative decoding. AdaServe leverages the logits of a draft model to …
MARM: Unlocking the Future of Recommendation Systems through Memory Augmentation and Scalable Complexity
Scaling-law has guided the language model designing for past years, however, it is worth
noting that the scaling laws of NLP cannot be directly applied to RecSys due to the following …
noting that the scaling laws of NLP cannot be directly applied to RecSys due to the following …
ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction
Large Language Models (LLMs) are widely used in today's tasks of natural language
processing. To support applications like multi-turn chats, document understanding, and …
processing. To support applications like multi-turn chats, document understanding, and …
DeServe: Towards Affordable Offline LLM Inference via Decentralization
The rapid growth of generative AI and its integration into everyday workflows have
significantly increased the demand for large language model (LLM) inference services …
significantly increased the demand for large language model (LLM) inference services …