Transformers in remote sensing: A survey

AA Aleissaee, A Kumar, RM Anwer, S Khan… - Remote Sensing, 2023‏ - mdpi.com
Deep learning-based algorithms have seen a massive popularity in different areas of remote
sensing image analysis over the past decade. Recently, transformer-based architectures …

[HTML][HTML] Deep learning based computer vision approaches for smart agricultural applications

VG Dhanya, A Subeesh, NL Kushwaha… - Artificial Intelligence in …, 2022‏ - Elsevier
The agriculture industry is undergoing a rapid digital transformation and is growing powerful
by the pillars of cutting-edge approaches like artificial intelligence and allied technologies …

Image as a foreign language: Beit pretraining for vision and vision-language tasks

W Wang, H Bao, L Dong, J Bjorck… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
A big convergence of language, vision, and multimodal pretraining is emerging. In this work,
we introduce a general-purpose multimodal foundation model BEiT-3, which achieves …

Open-vocabulary panoptic segmentation with text-to-image diffusion models

J Xu, S Liu, A Vahdat, W Byeon… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies
pre-trained text-image diffusion and discriminative models to perform open-vocabulary …

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024‏ - Springer
In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Videomae v2: Scaling video masked autoencoders with dual masking

L Wang, B Huang, Z Zhao, Z Tong… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
Scale is the primary factor for building a powerful foundation model that could well
generalize to a variety of downstream tasks. However, it is still challenging to train video …

Videomamba: State space model for efficient video understanding

K Li, X Li, Y Wang, Y He, Y Wang, L Wang… - European Conference on …, 2024‏ - Springer
Addressing the dual challenges of local redundancy and global dependencies in video
understanding, this work innovatively adapts the Mamba to the video domain. The proposed …

Eva: Exploring the limits of masked visual representation learning at scale

Y Fang, W Wang, B **e, Q Sun, L Wu… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
We launch EVA, a vision-centric foundation model to explore the limits of visual
representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …

Eva-02: A visual representation for neon genesis

Y Fang, Q Sun, X Wang, T Huang, X Wang… - Image and Vision …, 2024‏ - Elsevier
We launch EVA-02, a next-generation Transformer-based visual representation pre-trained
to reconstruct strong and robust language-aligned vision features via masked image …

Masked autoencoders as spatiotemporal learners

C Feichtenhofer, Y Li, K He - Advances in neural …, 2022‏ - proceedings.neurips.cc
This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to
spatiotemporal representation learning from videos. We randomly mask out spacetime …