A survey of techniques for optimizing transformer inference
Recent years have seen a phenomenal rise in the performance and applications of
transformer neural networks. The family of transformer networks, including Bidirectional …
transformer neural networks. The family of transformer networks, including Bidirectional …
Weight-sharing neural architecture search: A battle to shrink the optimization gap
Neural architecture search (NAS) has attracted increasing attention. In recent years,
individual search methods have been replaced by weight-sharing search methods for higher …
individual search methods have been replaced by weight-sharing search methods for higher …
Squeezellm: Dense-and-sparse quantization
Generative Large Language Models (LLMs) have demonstrated remarkable results for a
wide range of tasks. However, deploying these models for inference has been a significant …
wide range of tasks. However, deploying these models for inference has been a significant …
A fast post-training pruning framework for transformers
Pruning is an effective way to reduce the huge inference cost of Transformer models.
However, prior work on pruning Transformers requires retraining the models. This can add …
However, prior work on pruning Transformers requires retraining the models. This can add …
Speculative decoding with big little decoder
The recent emergence of Large Language Models based on the Transformer architecture
has enabled dramatic advancements in the field of Natural Language Processing. However …
has enabled dramatic advancements in the field of Natural Language Processing. However …
Enable deep learning on mobile devices: Methods, systems, and applications
Deep neural networks (DNNs) have achieved unprecedented success in the field of artificial
intelligence (AI), including computer vision, natural language processing, and speech …
intelligence (AI), including computer vision, natural language processing, and speech …
Funnel-transformer: Filtering out sequential redundancy for efficient language processing
With the success of language pretraining, it is highly desirable to develop more efficient
architectures of good scalability that can exploit the abundant unlabeled data at a lower cost …
architectures of good scalability that can exploit the abundant unlabeled data at a lower cost …
Compressing large-scale transformer-based models: A case study on bert
Pre-trained Transformer-based models have achieved state-of-the-art performance for
various Natural Language Processing (NLP) tasks. However, these models often have …
various Natural Language Processing (NLP) tasks. However, these models often have …
Neural architecture search for transformers: A survey
Transformer-based Deep Neural Network architectures have gained tremendous interest
due to their effectiveness in various applications across Natural Language Processing (NLP) …
due to their effectiveness in various applications across Natural Language Processing (NLP) …
Fashionbert: Text and image matching with adaptive loss for cross-modal retrieval
In this paper, we address the text and image matching in cross-modal retrieval of the fashion
industry. Different from the matching in the general domain, the fashion matching is required …
industry. Different from the matching in the general domain, the fashion matching is required …