Self-supervised multimodal learning: A survey
Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …
modalities, has achieved substantial progress in the supervised regime in recent years …
Localizing objects with self-supervised transformers and no labels
Localizing objects in image collections without supervision can help to avoid expensive
annotation campaigns. We propose a simple approach to this problem, that leverages the …
annotation campaigns. We propose a simple approach to this problem, that leverages the …
Pop-3d: Open-vocabulary 3d occupancy prediction from images
We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map
from input 2D images with the objective of enabling 3D grounding, segmentation and …
from input 2D images with the objective of enabling 3D grounding, segmentation and …
Audio-visual generalised zero-shot learning with cross-modal attention and language
Learning to classify video data from classes not included in the training data, ie video-based
zero-shot learning, is challenging. We conjecture that the natural alignment between the …
zero-shot learning, is challenging. We conjecture that the natural alignment between the …
Multi-modal data clustering using deep learning: A systematic review
Multi-modal clustering represents a formidable challenge in the domain of unsupervised
learning. The objective of multi-modal clustering is to categorize data collected from diverse …
learning. The objective of multi-modal clustering is to categorize data collected from diverse …
Object-aware contrastive learning for debiased scene representation
Contrastive self-supervised learning has shown impressive results in learning visual
representations from unlabeled images by enforcing invariance against different data …
representations from unlabeled images by enforcing invariance against different data …
EclipSE: Efficient Long-Range Video Retrieval Using Sight and Sound
We introduce an audiovisual method for long-range text-to-video retrieval. Unlike previous
approaches designed for short video retrieval (eg, 5–15 s in duration), our approach aims to …
approaches designed for short video retrieval (eg, 5–15 s in duration), our approach aims to …
Temporal and cross-modal attention for audio-visual zero-shot learning
Audio-visual generalised zero-shot learning for video classification requires understanding
the relations between the audio and visual information in order to be able to recognise …
the relations between the audio and visual information in order to be able to recognise …
Sound localization from motion: Jointly learning sound direction and camera rotation
The images and sounds that we perceive undergo subtle but geometrically consistent
changes as we rotate our heads. In this paper, we use these cues to solve a problem we call …
changes as we rotate our heads. In this paper, we use these cues to solve a problem we call …
Drive&segment: Unsupervised semantic segmentation of urban scenes via cross-modal distillation
This work investigates learning pixel-wise semantic image segmentation in urban scenes
without any manual annotation, just from the raw non-curated data collected by cars which …
without any manual annotation, just from the raw non-curated data collected by cars which …