- Academic Search

J Xu, Z Li, W Chen, Q Wang, X Gao, Q Cai… - arxiv preprint arxiv …, 2024 - arxiv.org

The advent of large language models (LLMs) revolutionized natural language processing
applications, and running LLMs on edge devices has become increasingly attractive for …

Save Cite Cited by 16 Related articles All 3 versions Free GPT-4 DeepSeek View as HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Offline energy-optimal llm serving: Workload-based energy models for llm inference on heterogeneous systems

G Wilkins, S Keshav, R Mortier - arxiv preprint arxiv:2407.04014, 2024 - arxiv.org

The rapid adoption of large language models (LLMs) has led to significant advances in
natural language processing and text generation. However, the energy consumed through …

Save Cite Cited by 8 Related articles All 3 versions Free GPT-4 DeepSeek View as HTML

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Queue Management for SLO-Oriented Large Language Model Serving

A Patke, D Reddy, S Jha, H Qiu, C Pinto… - Proceedings of the …, 2024 - dl.acm.org

Large language model (LLM) serving is becoming an increasingly critical workload for cloud
providers. Existing LLM serving systems focus on interactive requests, such as chatbots and …

Save Cite Cited by 3 Related articles

[Free GPT-4]
[DeepSeek]

[PDF] ethz.ch

Deferred continuous batching in resource-efficient large language model serving

Y He, Y Lu, G Alonso - Proceedings of the 4th Workshop on Machine …, 2024 - dl.acm.org

Despite that prior work of batched inference and parameter-efficient fine-tuning techniques
have reduced the resource requirements of large language models (LLMs), challenges …

Save Cite Cited by 4 Related articles All 3 versions Free GPT-4 DeepSeek

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Watermarking Large Language Models and the Generated Content: Opportunities and Challenges

R Zhang, F Koushanfar - arxiv preprint arxiv:2410.19096, 2024 - arxiv.org

The widely adopted and powerful generative large language models (LLMs) have raised
concerns about intellectual property rights violations and the spread of machine-generated …

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

On the Cost of Model-Serving Frameworks: An Experimental Evaluation

P De Rosa, YD Bromberg, P Felber… - 2024 IEEE …, 2024 - ieeexplore.ieee.org

In machine learning (ML), the inference phase is the process of applying pre-trained models
to new, unseen data with the objective of making predictions. During the inference phase …

Save Cite Cited by 2 Related articles All 5 versions Free GPT-4 DeepSeek

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Efficient LLM Scheduling by Learning to Rank

Y Fu, S Zhu, R Su, A Qiao, I Stoica, H Zhang - arxiv preprint arxiv …, 2024 - arxiv.org

In Large Language Model (LLM) inference, the output length of an LLM request is typically
regarded as not known a priori. Consequently, most LLM serving systems employ a simple …

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Software Performance Engineering for Foundation Model-Powered Software (FMware)

H Zhang, S Chang, A Leung, K Thangarajah… - arxiv preprint arxiv …, 2024 - arxiv.org

The rise of Foundation Models (FMs) like Large Language Models (LLMs) is revolutionizing
software development. Despite the impressive prototypes, transforming FMware into …

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

IMI: In-memory Multi-job Inference Acceleration for Large Language Models

B Gao, Z Wang, Z He, T Luo, WF Wong… - Proceedings of the 53rd …, 2024 - dl.acm.org

Large Language Models (LLMs) are increasingly used in various applications but are
computationally complex and energy-consuming due to the high volume of off-chip memory …

Save Cite Related articles

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads

G Wilkins, S Keshav, R Mortier - Proceedings of the 15th ACM …, 2024 - dl.acm.org

Both the training and use of Large Language Models (LLMs) require large amounts of
energy. Their increasing popularity, therefore, raises critical concerns regarding the energy …

Save Cite Cited by 9 Related articles

Create alert

Cite

Advanced search

Saved to My library

Towards Efficient and Reliable LLM Serving: A Real-World Workload Study

On-device language models: A comprehensive review

Offline energy-optimal llm serving: Workload-based energy models for llm inference on heterogeneous systems

Queue Management for SLO-Oriented Large Language Model Serving

Deferred continuous batching in resource-efficient large language model serving

Watermarking Large Language Models and the Generated Content: Opportunities and Challenges

On the Cost of Model-Serving Frameworks: An Experimental Evaluation

Efficient LLM Scheduling by Learning to Rank

Software Performance Engineering for Foundation Model-Powered Software (FMware)

IMI: In-memory Multi-job Inference Acceleration for Large Language Models

Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads