Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention
The computational challenges of Large Language Model (LLM) inference remain a
significant barrier to their widespread deployment, especially as prompt lengths continue to …
significant barrier to their widespread deployment, especially as prompt lengths continue to …
[PDF][PDF] A comprehensive survey of small language models in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and …
Large language models (LLM) have demonstrated emergent abilities in text generation,
question answering, and reasoning, facilitating various tasks and domains. Despite their …
question answering, and reasoning, facilitating various tasks and domains. Despite their …
Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference
Long-context Multimodal Large Language Models (MLLMs) demand substantial
computational resources for inference as the growth of their multimodal Key-Value (KV) …
computational resources for inference as the growth of their multimodal Key-Value (KV) …
A survey of mamba
As one of the most representative DL techniques, Transformer architecture has empowered
numerous advanced models, especially the large language models (LLMs) that comprise …
numerous advanced models, especially the large language models (LLMs) that comprise …
Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference
Many computational factors limit broader deployment of large language models. In this
paper, we focus on a memory bottleneck imposed by the key-value (KV) cache, a …
paper, we focus on a memory bottleneck imposed by the key-value (KV) cache, a …
Kv cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches
Long context capability is a crucial competency for large language models (LLMs) as it
mitigates the human struggle to digest long-form texts. This capability enables complex task …
mitigates the human struggle to digest long-form texts. This capability enables complex task …
Human-like episodic memory for infinite context llms
Large language models (LLMs) have shown remarkable capabilities, but still struggle with
processing extensive contexts, limiting their ability to maintain coherence and accuracy over …
processing extensive contexts, limiting their ability to maintain coherence and accuracy over …
Lazyllm: Dynamic token pruning for efficient long context llm inference
The inference of transformer-based large language models consists of two sequential
stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token …
stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token …
D2o: Dynamic discriminative operations for efficient generative inference of large language models
Efficient inference in Large Language Models (LLMs) is impeded by the growing memory
demands of key-value (KV) caching, especially for longer sequences. Traditional KV cache …
demands of key-value (KV) caching, especially for longer sequences. Traditional KV cache …
A deeper look at depth pruning of llms
Large Language Models (LLMs) are not only resource-intensive to train but even more
costly to deploy in production. Therefore, recent work has attempted to prune blocks of LLMs …
costly to deploy in production. Therefore, recent work has attempted to prune blocks of LLMs …