Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
A comprehensive survey of compression algorithms for language models
How can we compress language models without sacrificing accuracy? The number of
compression algorithms for language models is rapidly growing to benefit from remarkable …
compression algorithms for language models is rapidly growing to benefit from remarkable …
Understanding the potential of fpga-based spatial acceleration for large language model inference
Recent advancements in large language models (LLMs) boasting billions of parameters
have generated a significant demand for efficient deployment in inference workloads. While …
have generated a significant demand for efficient deployment in inference workloads. While …
Speed: Speculative pipelined execution for efficient decoding
Generative Large Language Models (LLMs) based on the Transformer architecture have
recently emerged as a dominant foundation model for a wide range of Natural Language …
recently emerged as a dominant foundation model for a wide range of Natural Language …
Chunkattention: Efficient self-attention with prefix-aware kv cache and two-phase partition
L Ye, Z Tao, Y Huang, Y Li - arxiv preprint arxiv:2402.15220, 2024 - arxiv.org
Self-attention is an essential component of large language models (LLM) but a significant
source of inference latency for long sequences. In multi-tenant LLM serving scenarios, the …
source of inference latency for long sequences. In multi-tenant LLM serving scenarios, the …
Magr: Weight magnitude reduction for enhancing post-training quantization
In this paper, we present a simple optimization-based preprocessing technique called
Weight Magnitude Reduction (MagR) to improve the performance of post-training …
Weight Magnitude Reduction (MagR) to improve the performance of post-training …
SOFA: A compute-memory optimized sparsity accelerator via cross-stage coordinated tiling
H Wang, J Fang, X Tang, Z Yue, J Li… - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org
Benefiting from the self-attention mechanism, Transformer models have attained impressive
contextual comprehension capabilities for lengthy texts. The requirements of high …
contextual comprehension capabilities for lengthy texts. The requirements of high …
Zed: A generalized accelerator for variably sparse matrix computations in ml
Modern Machine Learning (ML) models employ sparsity to mitigate storage and computation
costs; but it gives rise to irregular and unstructured sparse matrix operations that dominate …
costs; but it gives rise to irregular and unstructured sparse matrix operations that dominate …
Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models
In real world, large language models (LLMs) can serve as the assistant to help users
accomplish their jobs, and also support the development of advanced applications. For the …
accomplish their jobs, and also support the development of advanced applications. For the …
Conformer-based speech recognition on extreme edge-computing devices
With increasingly more powerful compute capabilities and resources in today's devices,
traditionally compute-intensive automatic speech recognition (ASR) has been moving from …
traditionally compute-intensive automatic speech recognition (ASR) has been moving from …
Foundations of Large Language Models
This is a book about large language models. As indicated by the title, it primarily focuses on
foundational concepts rather than comprehensive coverage of all cutting-edge technologies …
foundational concepts rather than comprehensive coverage of all cutting-edge technologies …