Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Self-supervised multimodal learning: A survey

Y Zong, O Mac Aodha, T Hospedales - arxiv preprint arxiv:2304.01008, 2023 - arxiv.org
Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …

Omnivec: Learning robust representations with cross modal sharing

S Srivastava, G Sharma - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com
Majority of research in learning based methods has been towards designing and training
networks for specific tasks. However, many of the learning based tasks, across modalities …

Contrastive audio-visual masked autoencoder

Y Gong, A Rouditchenko, AH Liu, D Harwath… - arxiv preprint arxiv …, 2022 - arxiv.org
In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single
modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio …

Adamv-moe: Adaptive multi-task vision mixture-of-experts

T Chen, X Chen, X Du, A Rashwan… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Sparsely activated Mixture-of-Experts (MoE) is becoming a promising paradigm for
multi-task learning (MTL). Instead of compressing multiple tasks' knowledge into a single …

WebVoyager: Building an end-to-end web agent with large multimodal models

H He, W Yao, K Ma, W Yu, Y Dai, H Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid advancement of large language models (LLMs) has led to a new era marked by
the development of autonomous applications in real-world scenarios, which drives …

Omnivec2-a novel transformer based network for large scale multimodal and multitask learning

S Srivastava, G Sharma - … of the IEEE/CVF conference on …, 2024 - openaccess.thecvf.com
We present a novel multimodal multitask network and associated training algorithm. The
method is capable of ingesting data from approximately 12 different modalities namely …

Multimodal distillation for egocentric action recognition

G Radevski, D Grujicic, M Blaschko… - Proceedings of the …, 2023 - openaccess.thecvf.com
The focal point of egocentric video understanding is modelling hand-object interactions.
Standard models, eg CNNs or Vision Transformers, which receive RGB frames as input …

Sparse moe as the new dropout: Scaling dense and self-slimmable transformers

T Chen, Z Zhang, A Jaiswal, S Liu, Z Wang - arxiv preprint arxiv …, 2023 - arxiv.org
Despite their remarkable achievement, gigantic transformers encounter significant
drawbacks, including exorbitant computational and memory footprints during training, as …

[PDF][PDF] Versatile audio-visual learning for handling single and multi modalities in emotion regression and classification tasks

L Goncalves, SG Leem, WC Lin, B Sisman… - arxiv preprint arxiv …, 2023 - ecs.utdallas.edu
Most current audio-visual emotion recognition models lack the flexibility needed for
deployment in practical applications. We envision a multimodal system that works even …