Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

A primer on contrastive pretraining in language processing: Methods, lessons learned, and perspectives

N Rethmeier, I Augenstein - ACM Computing Surveys, 2023 - dl.acm.org
Modern natural language processing (NLP) methods employ self-supervised pretraining
objectives such as masked language modeling to boost the performance of various …

Llm-pruner: On the structural pruning of large language models

X Ma, G Fang, X Wang - Advances in neural information …, 2023 - proceedings.neurips.cc
Large language models (LLMs) have shown remarkable capabilities in language
understanding and generation. However, such impressive capability typically comes with a …

Losparse: Structured compression of large language models based on low-rank and sparse approximation

Y Li, Y Yu, Q Zhang, C Liang, P He… - International …, 2023 - proceedings.mlr.press
Transformer models have achieved remarkable results in various natural language tasks,
but they are often prohibitively large, requiring massive memories and computational …

Compression of generative pre-trained language models via quantization

C Tao, L Hou, W Zhang, L Shang, X Jiang, Q Liu… - arxiv preprint arxiv …, 2022 - arxiv.org
The increasing size of generative Pre-trained Language Models (PLMs) has greatly
increased the demand for model compression. Despite various methods to compress BERT …

Compressing large-scale transformer-based models: A case study on bert

P Ganesh, Y Chen, X Lou, MA Khan, Y Yang… - Transactions of the …, 2021 - direct.mit.edu
Pre-trained Transformer-based models have achieved state-of-the-art performance for
various Natural Language Processing (NLP) tasks. However, these models often have …

Less is more: Task-aware layer-wise distillation for language model compression

C Liang, S Zuo, Q Zhang, P He… - … on Machine Learning, 2023 - proceedings.mlr.press
Layer-wise distillation is a powerful tool to compress large models (ie teacher models) into
small ones (ie, student models). The student distills knowledge from the teacher by …

Wasserstein contrastive representation distillation

L Chen, D Wang, Z Gan, J Liu… - Proceedings of the …, 2021 - openaccess.thecvf.com
The primary goal of knowledge distillation (KD) is to encapsulate the information of a model
learned from a teacher network into a student network, with the latter being more compact …

Not all negatives are equal: Label-aware contrastive loss for fine-grained text classification

V Suresh, DC Ong - arxiv preprint arxiv:2109.05427, 2021 - arxiv.org
Fine-grained classification involves dealing with datasets with larger number of classes with
subtle differences between them. Guiding the model to focus on differentiating dimensions …

Compressing visual-linguistic model via knowledge distillation

Z Fang, J Wang, X Hu, L Wang… - Proceedings of the …, 2021 - openaccess.thecvf.com
Despite exciting progress in pre-training for visual-linguistic (VL) representations, very few
aspire to a small VL model. In this paper, we study knowledge distillation (KD) to effectively …