A survey of techniques for optimizing transformer inference
Recent years have seen a phenomenal rise in the performance and applications of
transformer neural networks. The family of transformer networks, including Bidirectional …
transformer neural networks. The family of transformer networks, including Bidirectional …
Squeezellm: Dense-and-sparse quantization
Generative Large Language Models (LLMs) have demonstrated remarkable results for a
wide range of tasks. However, deploying these models for inference has been a significant …
wide range of tasks. However, deploying these models for inference has been a significant …
Speculative decoding with big little decoder
The recent emergence of Large Language Models based on the Transformer architecture
has enabled dramatic advancements in the field of Natural Language Processing. However …
has enabled dramatic advancements in the field of Natural Language Processing. However …
Towards efficient generative large language model serving: A survey from algorithms to systems
In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …
The What, Why, and How of Context Length Extension Techniques in Large Language Models--A Detailed Survey
The advent of Large Language Models (LLMs) represents a notable breakthrough in Natural
Language Processing (NLP), contributing to substantial progress in both text …
Language Processing (NLP), contributing to substantial progress in both text …
Relu strikes back: Exploiting activation sparsity in large language models
Large Language Models (LLMs) with billions of parameters have drastically transformed AI
applications. However, their demanding computation during inference has raised significant …
applications. However, their demanding computation during inference has raised significant …
A survey of resource-efficient llm and multimodal foundation models
Large foundation models, including large language models (LLMs), vision transformers
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …
Response length perception and sequence scheduling: An llm-empowered llm inference pipeline
Large language models (LLMs) have revolutionized the field of AI, demonstrating
unprecedented capacity across various tasks. However, the inference process for LLMs …
unprecedented capacity across various tasks. However, the inference process for LLMs …
{Quant-LLM}: Accelerating the Serving of Large Language Models via {FP6-Centric}{Algorithm-System}{Co-Design} on Modern {GPUs}
Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs)
and preserve the model quality consistently across varied applications. However, existing …
and preserve the model quality consistently across varied applications. However, existing …
Llmcad: Fast and scalable on-device large language model inference
Generative tasks, such as text generation and question answering, hold a crucial position in
the realm of mobile applications. Due to their sensitivity to privacy concerns, there is a …
the realm of mobile applications. Due to their sensitivity to privacy concerns, there is a …