Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism
The context window of large language models (LLMs) is rapidly increasing, leading to a
huge variance in resource usage between different requests as well as between different …
huge variance in resource usage between different requests as well as between different …
Rethinking cloud abstractions for tenant-provider cooperative optimization of AI workloads
AI workloads, often hosted in multi-tenant cloud environments, require vast computational
resources but suffer inefficiencies due to limited tenant-provider coordination. Tenants lack …
resources but suffer inefficiencies due to limited tenant-provider coordination. Tenants lack …
SYMPHONY: Improving Memory Management for LLM Inference Workloads
Large Language Models (LLMs) are increasingly being deployed in applications such as
chatbots, code editors, and conversational agents. A key feature of LLMs is their ability to …
chatbots, code editors, and conversational agents. A key feature of LLMs is their ability to …
Towards Efficient Large Multimodal Model Serving
Recent advances in generative AI have led to large multi-modal models (LMMs) capable of
simultaneously processing inputs of various modalities such as text, images, video, and …
simultaneously processing inputs of various modalities such as text, images, video, and …
TimelyLLM: Segmented LLM Serving System for Time-sensitive Robotic Applications
Large Language Models (LLMs) such as GPT-4 and Llama3 can already comprehend
complex commands and process diverse tasks. This advancement facilitates their …
complex commands and process diverse tasks. This advancement facilitates their …
Revisiting SLO and Goodput Metrics in LLM Serving
Z Wang, S Li, Y Zhou, X Li, R Gu, N Cam-Tu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) have achieved remarkable performance and are widely
deployed in various applications, while the serving of LLM inference has raised concerns …
deployed in various applications, while the serving of LLM inference has raised concerns …