Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Minicache: Kv cache compression in depth dimension for large language models
A critical approach for efficiently deploying computationally demanding large language
models (LLMs) is Key-Value (KV) caching. The KV cache stores key-value states of …
models (LLMs) is Key-Value (KV) caching. The KV cache stores key-value states of …
The unreasonable ineffectiveness of the deeper layers
We empirically study a simple layer-pruning strategy for popular families of open-weight
pretrained LLMs, finding minimal degradation of performance on different question …
pretrained LLMs, finding minimal degradation of performance on different question …
Large language model inference acceleration: A comprehensive hardware perspective
Large Language Models (LLMs) have demonstrated remarkable capabilities across various
fields, from natural language understanding to text generation. Compared to non-generative …
fields, from natural language understanding to text generation. Compared to non-generative …
Eagle-2: Faster inference of language models with dynamic draft trees
Inference with modern Large Language Models (LLMs) is expensive and time-consuming,
and speculative sampling has proven to be an effective solution. Most speculative sampling …
and speculative sampling has proven to be an effective solution. Most speculative sampling …
Knowledge circuits in pretrained transformers
The remarkable capabilities of modern large language models are rooted in their vast
repositories of knowledge encoded within their parameters, enabling them to perceive the …
repositories of knowledge encoded within their parameters, enabling them to perceive the …
Multi-layer transformers gradient can be approximated in almost linear time
The computational complexity of the self-attention mechanism in popular transformer
architectures poses significant challenges for training and inference, and becomes the …
architectures poses significant challenges for training and inference, and becomes the …
A tighter complexity analysis of sparsegpt
In this work, we improved the analysis of the running time of SparseGPT [Frantar, Alistarh
ICML 2023] from $ O (d^{3}) $ to $ O (d^{\omega}+ d^{2+ a+ o (1)}+ d^{1+\omega (1, 1, a)-a}) …
ICML 2023] from $ O (d^{3}) $ to $ O (d^{\omega}+ d^{2+ a+ o (1)}+ d^{1+\omega (1, 1, a)-a}) …
Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation
A well-known dilemma in large vision-language models (eg, GPT-4, LLaVA) is that while
increasing the number of vision tokens generally enhances visual understanding, it also …
increasing the number of vision tokens generally enhances visual understanding, it also …
Lazyllm: Dynamic token pruning for efficient long context llm inference
The inference of transformer-based large language models consists of two sequential
stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token …
stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token …
Challenges in deploying long-context transformers: A theoretical peak performance analysis
Y Fu - arxiv preprint arxiv:2405.08944, 2024 - arxiv.org
Transformer-based long context generative models power emerging AI applications like
hour-long video understanding and project-level coding agent. Deploying long context …
hour-long video understanding and project-level coding agent. Deploying long context …