Towards all-in-one pre-training via maximizing multi-modal mutual information

W Su, X Zhu, C Tao, L Lu, B Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
To effectively exploit the potential of large-scale models, various pre-training strategies
supported by massive data from different sources are proposed, including supervised pre …

Multitask vision-language prompt tuning

S Shen, S Yang, T Zhang, B Zhai… - Proceedings of the …, 2024 - openaccess.thecvf.com
Prompt Tuning, conditioning on task-specific learned prompt vectors, has emerged as a data-
efficient and parameter-efficient method for adapting large pretrained vision-language …

Masked autoencoders are efficient class incremental learners

JT Zhai, X Liu, AD Bagdanov, K Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Class Incremental Learning (CIL) aims to sequentially learn new classes while
avoiding catastrophic forgetting of previous knowledge. We propose to use Masked …

Masked autoencoders in computer vision: A comprehensive survey

Z Zhou, X Liu - IEEE Access, 2023 - ieeexplore.ieee.org
Masked autoencoders (MAE) is a deep learning method based on Transformer. Originally
used for images, it has now been extended to video, audio, and some other temporal …

Masked Autoencoders are Secretly Efficient Learners

Z Wei, C Wei, J Mei, Y Bai, Z Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com
This paper provides an efficiency study of training Masked Autoencoders (MAE) a framework
introduced by He et. al. for pre-training Vision Transformers (ViTs). Our results surprisingly …

Masked autoencoding does not help natural language supervision at scale

F Weers, V Shankar, A Katharopoulos… - Proceedings of the …, 2023 - openaccess.thecvf.com
Self supervision and natural language supervision have emerged as two exciting ways to
train general purpose image encoders which excel at a variety of downstream tasks. Recent …

Masked Audio Modeling with CLAP and Multi-Objective Learning

Y **n, X Peng, Y Lu - arxiv preprint arxiv:2401.15953, 2024 - arxiv.org
Most existing masked audio modeling (MAM) methods learn audio representations by
masking and reconstructing local spectrogram patches. However, the reconstruction loss …

Masked Image Modeling: A Survey

V Hondru, FA Croitoru, S Minaee, RT Ionescu… - arxiv preprint arxiv …, 2024 - arxiv.org
In this work, we survey recent studies on masked image modeling (MIM), an approach that
emerged as a powerful self-supervised learning technique in computer vision. The MIM task …

Self-supervised approach for diabetic retinopathy severity detection using vision transformer

K Ohri, M Kumar, D Sukheja - Progress in Artificial Intelligence, 2024 - Springer
Diabetic retinopathy (DR) is a diabetic condition that affects vision, despite the great success
of supervised learning and Conventional Neural Networks (CNNs), it's still challenging to …

Aerial image object detection with vision transformer detector (ViTDet)

L Wang, A Tien - IGARSS 2023-2023 IEEE International …, 2023 - ieeexplore.ieee.org
The past few years have seen an increased interest in aerial image object detection due to
its critical value to large-scale geoscientific research like environmental studies, urban …