Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Retrievalattention: Accelerating long-context llm inference via vector retrieval
Transformer-based Large Language Models (LLMs) have become increasingly important.
However, due to the quadratic time complexity of attention computation, scaling LLMs to …
However, due to the quadratic time complexity of attention computation, scaling LLMs to …
Neo: Saving gpu memory crisis with cpu offloading for online llm inference
Online LLM inference powers many exciting applications such as intelligent chatbots and
autonomous agents. Modern LLM inference engines widely rely on request batching to …
autonomous agents. Modern LLM inference engines widely rely on request batching to …
Expertflow: Optimized expert activation and token allocation for efficient mixture-of-experts inference
Sparse Mixture of Experts (MoE) models, while outperforming dense Large Language
Models (LLMs) in terms of performance, face significant deployment challenges during …
Models (LLMs) in terms of performance, face significant deployment challenges during …
DeepFlow: Serverless Large Language Model Serving at Scale
J Hu, J Xu, Z Liu, Y He, Y Chen, H Xu, J Liu… - arxiv preprint arxiv …, 2025 - arxiv.org
This paper introduces DeepFlow, a scalable and serverless AI platform designed to
efficiently serve large language models (LLMs) at scale in cloud environments. DeepFlow …
efficiently serve large language models (LLMs) at scale in cloud environments. DeepFlow …
iServe: An Intent-based Serving System for LLMs
Large Language Models (LLMs) are becoming ubiquitous across industries, where
applications demand they fulfill diverse user intents. However, developers currently face the …
applications demand they fulfill diverse user intents. However, developers currently face the …
KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management
The stateful nature of large language model (LLM) servingcan easily throttle precious GPU
memory under load burstor long-generation requests like chain-of-thought reasoning …
memory under load burstor long-generation requests like chain-of-thought reasoning …
[CITACE][C] Empowering Large Language Models to Edge Intelligence: A Survey of Edge Efficient LLMs and Techniques
R Wanga, Z Gaoa, L Zhanga, S Yuea, Z Gaoa