Advances in medical image analysis with vision transformers: a comprehensive review

R Azad, A Kazerouni, M Heidari, EK Aghdam… - Medical Image …, 2024 - Elsevier
The remarkable performance of the Transformer architecture in natural language processing
has recently also triggered broad interest in Computer Vision. Among other merits …

Dinov2: Learning robust visual features without supervision

M Oquab, T Darcet, T Moutakanni, H Vo… - arxiv preprint arxiv …, 2023 - arxiv.org
The recent breakthroughs in natural language processing for model pretraining on large
quantities of data have opened the way for similar foundation models in computer vision …

Self-supervised learning from images with a joint-embedding predictive architecture

M Assran, Q Duval, I Misra… - Proceedings of the …, 2023 - openaccess.thecvf.com
This paper demonstrates an approach for learning highly semantic image representations
without relying on hand-crafted data-augmentations. We introduce the Image-based Joint …

Semantic image segmentation: Two decades of research

G Csurka, R Volpi, B Chidlovskii - Foundations and Trends® …, 2022 - nowpublishers.com
Semantic image segmentation (SiS) plays a fundamental role in a broad variety of computer
vision applications, providing key information for the global understanding of an image. This …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Deit iii: Revenge of the vit

H Touvron, M Cord, H Jégou - European conference on computer vision, 2022 - Springer
Abstract A Vision Transformer (ViT) is a simple neural architecture amenable to serve
several computer vision tasks. It has limited built-in architectural priors, in contrast to more …

Masked siamese networks for label-efficient learning

M Assran, M Caron, I Misra, P Bojanowski… - … on Computer Vision, 2022 - Springer
Abstract We propose Masked Siamese Networks (MSN), a self-supervised learning
framework for learning image representations. Our approach matches the representation of …

Slip: Self-supervision meets language-image pre-training

N Mu, A Kirillov, D Wagner, S **e - European conference on computer …, 2022 - Springer
Recent work has shown that self-supervised pre-training leads to improvements over
supervised learning on challenging visual recognition tasks. CLIP, an exciting new …

Context autoencoder for self-supervised representation learning

X Chen, M Ding, X Wang, Y **n, S Mo, Y Wang… - International Journal of …, 2024 - Springer
We present a novel masked image modeling (MIM) approach, context autoencoder (CAE),
for self-supervised representation pretraining. We pretrain an encoder by making predictions …

Beit v2: Masked image modeling with vector-quantized visual tokenizers

Z Peng, L Dong, H Bao, Q Ye, F Wei - arxiv preprint arxiv:2208.06366, 2022 - arxiv.org
Masked image modeling (MIM) has demonstrated impressive results in self-supervised
representation learning by recovering corrupted image patches. However, most existing …