Μελετητής Google

P Patel, E Choukse, C Zhang, A Shah… - 2024 ACM/IEEE 51st …, 2024 - ieeexplore.ieee.org

Generative large language model (LLM) applications are growing rapidly, leading to large-
scale deployments of expensive and power-hungry GPUs. Our characterization of LLM …

Αποθήκευση Παράθεση Γίνεται αναφορά σε 103 Σχετικά άρθρα Όλες οι 2 εκδοχές

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Sglang: Efficient execution of structured language model programs

L Zheng, L Yin, Z **e, CL Sun… - Advances in …, 2025 - proceedings.neurips.cc

Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …

Αποθήκευση Παράθεση Γίνεται αναφορά σε 27 Σχετικά άρθρα Προβολή ως HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Llm inference serving: Survey of recent advances and opportunities

B Li, Y Jiang, V Gadepally, D Tiwari - arxiv preprint arxiv:2407.12391, 2024 - arxiv.org

This survey offers a comprehensive overview of recent advancements in Large Language
Model (LLM) serving systems, focusing on research since the year 2023. We specifically …

Αποθήκευση Παράθεση Γίνεται αναφορά σε 14 Σχετικά άρθρα Όλες οι 3 εκδοχές Προβολή ως HTML

[Free GPT-4]
[DeepSeek]

[HTML] mdpi.com

[HTML][HTML] Large language models meet next-generation networking technologies: A review

CN Hang, PD Yu, R Morabito, CW Tan - Future Internet, 2024 - mdpi.com

The evolution of network technologies has significantly transformed global communication,
information sharing, and connectivity. Traditional networks, relying on static configurations …

Αποθήκευση Παράθεση Γίνεται αναφορά σε 7 Σχετικά άρθρα Όλες οι 7 εκδοχές Προσωρινά αποθηκευμένη

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Fast distributed inference serving for large language models

B Wu, Y Zhong, Z Zhang, S Liu, F Liu, Y Sun… - arxiv preprint arxiv …, 2023 - arxiv.org

Large language models (LLMs) power a new generation of interactive AI applications
exemplified by ChatGPT. The interactive nature of these applications demands low latency …

Αποθήκευση Παράθεση Γίνεται αναφορά σε 82 Σχετικά άρθρα Όλες οι 2 εκδοχές Προβολή ως HTML

[Free GPT-4]
[DeepSeek]

[PDF] nsf.gov

[PDF][PDF] Efficiently Programming Large Language Models using SGLang.

L Zheng, L Yin, Z **e, J Huang, C Sun, CH Yu, S Cao… - 2023 - par.nsf.gov

Large language models (LLMs) are increasingly used for complex tasks that require multiple
generation calls, advanced prompting techniques, control flow, and structured …

Αποθήκευση Παράθεση Γίνεται αναφορά σε 70 Σχετικά άρθρα Όλες οι 2 εκδοχές Προβολή ως HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Andes: Defining and enhancing quality-of-experience in llm-based text streaming services

J Liu, JW Chung, Z Wu, F Lai, M Lee… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) are now at the core of conversational AI services such as
real-time translation and chatbots, which provide live user interaction by incrementally …

Αποθήκευση Παράθεση Γίνεται αναφορά σε 13 Σχετικά άρθρα Όλες οι 2 εκδοχές Προβολή ως HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

Αποθήκευση Παράθεση Γίνεται αναφορά σε 74 Σχετικά άρθρα Όλες οι 5 εκδοχές Προβολή ως HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Inference without interference: Disaggregate llm inference for mixed downstream workloads

C Hu, H Huang, L Xu, X Chen, J Xu, S Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

Transformer-based large language model (LLM) inference serving is now the backbone of
many cloud services. LLM inference consists of a prefill phase and a decode phase …

Αποθήκευση Παράθεση Γίνεται αναφορά σε 38 Σχετικά άρθρα Όλες οι 2 εκδοχές Προβολή ως HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Taming throughput-latency tradeoff in llm inference with sarathi-serve

A Agrawal, N Kedia, A Panwar, J Mohan… - arxiv preprint arxiv …, 2024 - arxiv.org

Each LLM serving request goes through two phases. The first is prefill which processes the
entire input prompt to produce one output token and the second is decode which generates …

Αποθήκευση Παράθεση Γίνεται αναφορά σε 98 Σχετικά άρθρα Όλες οι 2 εκδοχές Προβολή ως HTML

Δημιουργία ειδοποίησης

Παράθεση

Σύνθετη αναζήτηση

Αποθηκεύτηκε στη Βιβλιοθήκη μου

Fairness in serving large language models

Splitwise: Efficient generative llm inference using phase splitting

Sglang: Efficient execution of structured language model programs

Llm inference serving: Survey of recent advances and opportunities

[HTML][HTML] Large language models meet next-generation networking technologies: A review

Fast distributed inference serving for large language models

[PDF][PDF] Efficiently Programming Large Language Models using SGLang.

Andes: Defining and enhancing quality-of-experience in llm-based text streaming services

A survey on efficient inference for large language models

Inference without interference: Disaggregate llm inference for mixed downstream workloads

Taming throughput-latency tradeoff in llm inference with sarathi-serve