Contrastive self-supervised learning: review, progress, challenges and future research directions

P Kumar, P Rawat, S Chauhan - International Journal of Multimedia …, 2022 - Springer
In the last decade, deep supervised learning has had tremendous success. However, its
flaws, such as its dependency on manual and costly annotations on large datasets and …

Curriculum learning: A survey

P Soviany, RT Ionescu, P Rota, N Sebe - International Journal of …, 2022 - Springer
Training machine learning models in a meaningful order, from the easy samples to the hard
ones, using curriculum learning can provide performance improvements over the standard …

Self-supervised learning of audio-visual objects from video

T Afouras, A Owens, JS Chung, A Zisserman - Computer Vision–ECCV …, 2020 - Springer
Our objective is to transform a video into a set of discrete audio-visual objects using self-
supervised learning. To this end, we introduce a model that uses attention to localize and …

Multiple sound sources localization from coarse to fine

R Qian, D Hu, H Dinkel, M Wu, N Xu, W Lin - Computer Vision–ECCV …, 2020 - Springer
How to visually localize multiple sound sources in unconstrained videos is a formidable
problem, especially when lack of the pairwise sound-object annotations. To solve this …

Discriminative sounding objects localization via self-supervised audiovisual matching

D Hu, R Qian, M Jiang, X Tan, S Wen… - Advances in …, 2020 - proceedings.neurips.cc
Discriminatively localizing sounding objects in cocktail-party, ie, mixed sound scenes, is
commonplace for humans, but still challenging for machines. In this paper, we propose a two …

Cyclic co-learning of sounding object visual grounding and sound separation

Y Tian, D Hu, C Xu - … of the IEEE/CVF Conference on …, 2021 - openaccess.thecvf.com
There are rich synchronized audio and visual events in our daily life. Inside the events,
audio scenes are associated with the corresponding visual objects; meanwhile, sounding …

Increasing Importance of Joint Analysis of Audio and Video in Computer Vision: A Survey

A Shahabaz, S Sarkar - IEEE Access, 2024 - ieeexplore.ieee.org
The joint analysis of audio and video is a powerful tool that can be applied to various
contexts, including action, speech, and sound recognition, audio-visual video parsing …

Into the wild with audioscope: Unsupervised audio-visual separation of on-screen sounds

E Tzinis, S Wisdom, A Jansen, S Hershey… - arxiv preprint arxiv …, 2020 - arxiv.org
Recent progress in deep learning has enabled many advances in sound separation and
visual scene understanding. However, extracting sound sources which are apparent in …

Self-supervised object detection from audio-visual correspondence

T Afouras, YM Asano, F Fagan… - Proceedings of the …, 2022 - openaccess.thecvf.com
We tackle the problem of learning object detectors without supervision. Differently from
weakly-supervised object detection, we do not assume image-level class labels. Instead, we …

Self-supervised predictive learning: A negative-free method for sound source localization in visual scenes

Z Song, Y Wang, J Fan, T Tan, Z Zhang - arxiv preprint arxiv:2203.13412, 2022 - arxiv.org
Sound source localization in visual scenes aims to localize objects emitting the sound in a
given image. Recent works showing impressive localization performance typically rely on …