- Academic Search

Y Zheng, Y Chen, B Qian, X Shi, Y Shu… - ACM Computing …, 2024‏ - dl.acm.org‏

Large language models (LLMs) have revolutionized natural language processing with their
exceptional understanding, synthesizing, and reasoning capabilities. However, deploying …‏

שמור צטט צוטט על ידי 7 מאמרים בנושא זה כל 2 הגרסאות

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Llm inference serving: Survey of recent advances and opportunities‏

B Li, Y Jiang, V Gadepally, D Tiwari - arxiv preprint arxiv:2407.12391, 2024‏ - arxiv.org‏

This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …‏

שמור צטט צוטט על ידי 16 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A survey on efficient inference for large language models‏

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …‏

שמור צטט צוטט על ידי 81 מאמרים בנושא זה כל 6 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism‏

B Wu, S Liu, Y Zhong, P Sun, X Liu, X ** - Proceedings of the ACM …, 2024‏ - dl.acm.org‏

The context window of large language models (LLMs) is rapidly increasing, leading to a
huge variance in resource usage between different requests as well as between different …‏

שמור צטט צוטט על ידי 26 מאמרים בנושא זה כל 3 הגרסאות

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Fast distributed inference serving for large language models‏

B Wu, Y Zhong, Z Zhang, S Liu, F Liu, Y Sun… - arxiv preprint arxiv …, 2023‏ - arxiv.org‏

Large language models (LLMs) power a new generation of interactive AI applications
exemplified by ChatGPT. The interactive nature of these applications demands low latency …‏

שמור צטט צוטט על ידי 88 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] usenix.org

Parrot: Efficient Serving of {LLM-based} Applications with Semantic Variable‏

C Lin, Z Han, C Zhang, Y Yang, F Yang… - … USENIX Symposium on …, 2024‏ - usenix.org‏

The rise of large language models (LLMs) has enabled LLM-based applications (aka AI
agents or co-pilots), a new software paradigm that combines the strength of LLM and …‏

שמור צטט צוטט על ידי 18 מאמרים בנושא זה כל 7 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] mlsys.org

Vidur: A large-scale simulation framework for llm inference‏

A Agrawal, N Kedia, J Mohan… - Proceedings of …, 2024‏ - proceedings.mlsys.org‏

Large language models (LLMs) are widely used in various domains for their ability to
perform tasks that requirehuman-like skills. However, LLM inference is expensive today …‏

שמור צטט צוטט על ידי 29 מאמרים בנושא זה כל 6 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Ragcache: Efficient knowledge caching for retrieval-augmented generation‏

C **, Z Zhang, X Jiang, F Liu, X Liu, X Liu… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Retrieval-Augmented Generation (RAG) has shown significant improvements in various
natural language processing tasks by integrating the strengths of large language models …‏

שמור צטט צוטט על ידי 35 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mooncake: A kvcache-centric disaggregated architecture for llm serving‏

R Qin, Z Li, W He, M Zhang, Y Wu, W Zheng… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It
features a KVCache-centric disaggregated architecture that separates the prefill and …‏

שמור צטט צוטט על ידי 26 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Empowering 1000 tokens/second on-device llm prefilling with mllm-npu‏

D Xu, H Zhang, L Yang, R Liu, G Huang, M Xu… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

On-device large language models (LLMs) are catalyzing novel mobile applications such as
UI task automation and personalized email auto-reply, without giving away users' private …‏

שמור צטט צוטט על ידי 16 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

יצירת התראה

צטט

חיפוש מתקדם

נשמר בספרייה שלי

{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving

A Review on edge large language models: Design, Execution, and Applications‏

Llm inference serving: Survey of recent advances and opportunities‏

A survey on efficient inference for large language models‏

Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism‏

Fast distributed inference serving for large language models‏

Parrot: Efficient Serving of {LLM-based} Applications with Semantic Variable‏

Vidur: A large-scale simulation framework for llm inference‏

Ragcache: Efficient knowledge caching for retrieval-augmented generation‏

Mooncake: A kvcache-centric disaggregated architecture for llm serving‏

Empowering 1000 tokens/second on-device llm prefilling with mllm-npu‏