Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}
Each LLM serving request goes through two phases. The first is prefill which processes the
entire input prompt and produces the first output token and the second is decode which …
entire input prompt and produces the first output token and the second is decode which …
Compute trends across three eras of machine learning
Compute, data, and algorithmic advances are the three fundamental factors that drive
progress in modern Machine Learning (ML). In this paper we study trends in the most readily …
progress in modern Machine Learning (ML). In this paper we study trends in the most readily …
A survey of resource-efficient llm and multimodal foundation models
Large foundation models, including large language models (LLMs), vision transformers
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …
Spotserve: Serving generative large language models on preemptible instances
The high computational and memory requirements of generative large language models
(LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary …
(LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary …
Resource-efficient algorithms and systems of foundation models: A survey
Large foundation models, including large language models, vision transformers, diffusion,
and large language model based multimodal models, are revolutionizing the entire machine …
and large language model based multimodal models, are revolutionizing the entire machine …
Characterization of large language model development in the datacenter
Large Language Models (LLMs) have presented impressive performance across several
transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster …
transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster …
Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills
Large Language Model (LLM) inference consists of two distinct phases-prefill phase which
processes the input prompt and decode phase which generates output tokens …
processes the input prompt and decode phase which generates output tokens …
Decentralized training of foundation models in heterogeneous environments
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often
involving tens of thousands of GPUs running continuously for months. These models are …
involving tens of thousands of GPUs running continuously for months. These models are …
Orion: Interference-aware, fine-grained GPU sharing for ML applications
GPUs are critical for maximizing the throughput-per-Watt of deep neural network (DNN)
applications. However, DNN applications often underutilize GPUs, even when using large …
applications. However, DNN applications often underutilize GPUs, even when using large …
Oobleck: Resilient distributed training of large models using pipeline templates
Oobleck enables resilient distributed training of large DNN models with guaranteed fault
tolerance. It takes a planning-execution co-design approach, where it first generates a set of …
tolerance. It takes a planning-execution co-design approach, where it first generates a set of …