Google Наука

A Agrawal, N Kedia, A Panwar, J Mohan… - … USENIX Symposium on …, 2024 - usenix.org

Each LLM serving request goes through two phases. The first is prefill which processes the
entire input prompt and produces the first output token and the second is decode which …

Запазване Позоваване С позовавания в 111 Сродни статии Всички 9 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Compute trends across three eras of machine learning

J Sevilla, L Heim, A Ho, T Besiroglu… - … Joint Conference on …, 2022 - ieeexplore.ieee.org

Compute, data, and algorithmic advances are the three fundamental factors that drive
progress in modern Machine Learning (ML). In this paper we study trends in the most readily …

Запазване Позоваване С позовавания в 383 Сродни статии Всички 4 версии

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A survey of resource-efficient llm and multimodal foundation models

M Xu, W Yin, D Cai, R Yi, D Xu, Q Wang, B Wu… - arxiv preprint arxiv …, 2024 - arxiv.org

Large foundation models, including large language models (LLMs), vision transformers
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …

Запазване Позоваване С позовавания в 102 Сродни статии Всички 4 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Spotserve: Serving generative large language models on preemptible instances

X Miao, C Shi, J Duan, X **, D Lin, B Cui… - Proceedings of the 29th …, 2024 - dl.acm.org

The high computational and memory requirements of generative large language models
(LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary …

Запазване Позоваване С позовавания в 54 Сродни статии Всички 7 версии

[Free GPT-4]
[DeepSeek]

[PDF] caidongqi.com

Resource-efficient algorithms and systems of foundation models: A survey

M Xu, D Cai, W Yin, S Wang, X **, X Liu - ACM Computing Surveys, 2025 - dl.acm.org

Large foundation models, including large language models, vision transformers, diffusion,
and large language model based multimodal models, are revolutionizing the entire machine …

Запазване Позоваване С позовавания в 2 Сродни статии Всички 3 версии

[Free GPT-4]
[DeepSeek]

[PDF] usenix.org

Characterization of large language model development in the datacenter

Q Hu, Z Ye, Z Wang, G Wang, M Zhang… - … USENIX Symposium on …, 2024 - usenix.org

Large Language Models (LLMs) have presented impressive performance across several
transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster …

Запазване Позоваване С позовавания в 42 Сродни статии Всички 9 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills

A Agrawal, A Panwar, J Mohan, N Kwatra… - arxiv preprint arxiv …, 2023 - arxiv.org

Large Language Model (LLM) inference consists of two distinct phases-prefill phase which
processes the input prompt and decode phase which generates output tokens …

Запазване Позоваване С позовавания в 85 Сродни статии Всички 4 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Decentralized training of foundation models in heterogeneous environments

B Yuan, Y He, J Davis, T Zhang… - Advances in …, 2022 - proceedings.neurips.cc

Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often
involving tens of thousands of GPUs running continuously for months. These models are …

Запазване Позоваване С позовавания в 93 Сродни статии Всички 10 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Orion: Interference-aware, fine-grained GPU sharing for ML applications

F Strati, X Ma, A Klimovic - … of the Nineteenth European Conference on …, 2024 - dl.acm.org

GPUs are critical for maximizing the throughput-per-Watt of deep neural network (DNN)
applications. However, DNN applications often underutilize GPUs, even when using large …

Запазване Позоваване С позовавания в 33 Сродни статии Всички 4 версии

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Oobleck: Resilient distributed training of large models using pipeline templates

I Jang, Z Yang, Z Zhang, X **… - Proceedings of the 29th …, 2023 - dl.acm.org

Oobleck enables resilient distributed training of large DNN models with guaranteed fault
tolerance. It takes a planning-execution co-design approach, where it first generates a set of …

Запазване Позоваване С позовавания в 35 Сродни статии Всички 7 версии

Създаване на сигнал

Позоваване

Разширено търсене

Запазено в „Моята библиотека“

Varuna: scalable, low-cost training of massive deep learning models

Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}

Compute trends across three eras of machine learning

A survey of resource-efficient llm and multimodal foundation models

Spotserve: Serving generative large language models on preemptible instances

Resource-efficient algorithms and systems of foundation models: A survey

Characterization of large language model development in the datacenter

Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills

Decentralized training of foundation models in heterogeneous environments

Orion: Interference-aware, fine-grained GPU sharing for ML applications

Oobleck: Resilient distributed training of large models using pipeline templates