Google Академик

Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism

B Li, Y Jiang, V Gadepally, D Tiwari - arxiv preprint arxiv:2407.12391, 2024 - arxiv.org

This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …

Сачувај Цитирај 15 пута наведен Сродни чланци Све верзије (3) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

Сачувај Цитирај 78 пута наведен Сродни чланци Све верзије (6) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Fast distributed inference serving for large language models

B Wu, Y Zhong, Z Zhang, S Liu, F Liu, Y Sun… - arxiv preprint arxiv …, 2023 - arxiv.org

Large language models (LLMs) power a new generation of interactive AI applications
exemplified by ChatGPT. The interactive nature of these applications demands low latency …

Сачувај Цитирај 87 пута наведен Сродни чланци Све верзије (3) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Empowering 1000 tokens/second on-device llm prefilling with mllm-npu

D Xu, H Zhang, L Yang, R Liu, G Huang, M Xu… - arxiv preprint arxiv …, 2024 - arxiv.org

On-device large language models (LLMs) are catalyzing novel mobile applications such as
UI task automation and personalized email auto-reply, without giving away users' private …

Сачувај Цитирај 15 пута наведен Сродни чланци Све верзије (2) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Memserve: Context caching for disaggregated llm serving with elastic memory pool

C Hu, H Huang, J Hu, J Xu, X Chen, T **e… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language model (LLM) serving has transformed from stateless to stateful systems,
utilizing techniques like context caching and disaggregated inference. These optimizations …

Сачувај Цитирај 15 пута наведен Сродни чланци Све верзије (2) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Cachegen: Kv cache compression and streaming for fast large language model serving

Y Liu, H Li, Y Cheng, S Ray, Y Huang… - Proceedings of the …, 2024 - dl.acm.org

As large language models (LLMs) take on complex tasks, their inputs are supplemented with
longer contexts that incorporate domain knowledge. Yet using long contexts is challenging …

Сачувај Цитирај 13 пута наведен Сродни чланци Све верзије (8)

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Andes: Defining and enhancing quality-of-experience in llm-based text streaming services

J Liu, JW Chung, Z Wu, F Lai, M Lee… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) are now at the core of conversational AI services such as
real-time translation and chatbots, which provide live user interaction by incrementally …

Сачувај Цитирај 17 пута наведен Сродни чланци Све верзије (4) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Shortcut-connected expert parallelism for accelerating mixture-of-experts

W Cai, J Jiang, L Qin, J Cui, S Kim, J Huang - arxiv preprint arxiv …, 2024 - arxiv.org

Expert parallelism has been introduced as a strategy to distribute the computational
workload of sparsely-gated mixture-of-experts (MoE) models across multiple computing …

Сачувај Цитирај 8 пута наведен Сродни чланци Све верзије (2) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mnemosyne: Parallelization strategies for efficiently serving multi-million context length llm inference requests without approximations

A Agrawal, J Chen, Í Goiri, R Ramjee, C Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

As large language models (LLMs) evolve to handle increasingly longer contexts, serving
inference requests for context lengths in the range of millions of tokens presents unique …

Сачувај Цитирај 3 пута наведен Сродни чланци Све верзије (2) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

SOFA: A compute-memory optimized sparsity accelerator via cross-stage coordinated tiling

H Wang, J Fang, X Tang, Z Yue, J Li… - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org

Benefiting from the self-attention mechanism, Transformer models have attained impressive
contextual comprehension capabilities for lengthy texts. The requirements of high …

Сачувај Цитирај 2 пута наведен Сродни чланци Све верзије (4)

Направи обавештење

Цитирај

Напредна претрага

Сачувано у мојој библиотеци

Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism

Llm inference serving: Survey of recent advances and opportunities

A survey on efficient inference for large language models

Fast distributed inference serving for large language models

Empowering 1000 tokens/second on-device llm prefilling with mllm-npu

Memserve: Context caching for disaggregated llm serving with elastic memory pool

Cachegen: Kv cache compression and streaming for fast large language model serving

Andes: Defining and enhancing quality-of-experience in llm-based text streaming services

Shortcut-connected expert parallelism for accelerating mixture-of-experts

Mnemosyne: Parallelization strategies for efficiently serving multi-million context length llm inference requests without approximations

SOFA: A compute-memory optimized sparsity accelerator via cross-stage coordinated tiling