Self-supervised multimodal learning: A survey

Y Zong, O Mac Aodha, T Hospedales - arxiv preprint arxiv:2304.01008, 2023 - arxiv.org
Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …

Localizing objects with self-supervised transformers and no labels

O Siméoni, G Puy, HV Vo, S Roburin, S Gidaris… - arxiv preprint arxiv …, 2021 - arxiv.org
Localizing objects in image collections without supervision can help to avoid expensive
annotation campaigns. We propose a simple approach to this problem, that leverages the …

Pop-3d: Open-vocabulary 3d occupancy prediction from images

A Vobecky, O Siméoni, D Hurych… - Advances in …, 2023 - proceedings.neurips.cc
We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map
from input 2D images with the objective of enabling 3D grounding, segmentation and …

Audio-visual generalised zero-shot learning with cross-modal attention and language

OB Mercea, L Riesch, A Koepke… - Proceedings of the …, 2022 - openaccess.thecvf.com
Learning to classify video data from classes not included in the training data, ie video-based
zero-shot learning, is challenging. We conjecture that the natural alignment between the …

Multi-modal data clustering using deep learning: A systematic review

S Raya, M Orabi, I Afyouni, Z Al Aghbari - Neurocomputing, 2024 - Elsevier
Multi-modal clustering represents a formidable challenge in the domain of unsupervised
learning. The objective of multi-modal clustering is to categorize data collected from diverse …

Object-aware contrastive learning for debiased scene representation

S Mo, H Kang, K Sohn, CL Li… - Advances in Neural …, 2021 - proceedings.neurips.cc
Contrastive self-supervised learning has shown impressive results in learning visual
representations from unlabeled images by enforcing invariance against different data …

EclipSE: Efficient Long-Range Video Retrieval Using Sight and Sound

YB Lin, J Lei, M Bansal, G Bertasius - European Conference on Computer …, 2022 - Springer
We introduce an audiovisual method for long-range text-to-video retrieval. Unlike previous
approaches designed for short video retrieval (eg, 5–15 s in duration), our approach aims to …

Temporal and cross-modal attention for audio-visual zero-shot learning

OB Mercea, T Hummel, AS Koepke, Z Akata - European Conference on …, 2022 - Springer
Audio-visual generalised zero-shot learning for video classification requires understanding
the relations between the audio and visual information in order to be able to recognise …

Sound localization from motion: Jointly learning sound direction and camera rotation

Z Chen, S Qian, A Owens - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
The images and sounds that we perceive undergo subtle but geometrically consistent
changes as we rotate our heads. In this paper, we use these cues to solve a problem we call …

Drive&segment: Unsupervised semantic segmentation of urban scenes via cross-modal distillation

A Vobecky, D Hurych, O Siméoni, S Gidaris… - … on Computer Vision, 2022 - Springer
This work investigates learning pixel-wise semantic image segmentation in urban scenes
without any manual annotation, just from the raw non-curated data collected by cars which …