Learning in audio-visual context: A review, analysis, and new perspective

Y Wei, D Hu, Y Tian, X Li - arxiv preprint arxiv:2208.09579, 2022 - arxiv.org
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

Causal reasoning meets visual representation learning: A prospective study

Y Liu, YS Wei, H Yan, GB Li, L Lin - Machine Intelligence Research, 2022 - Springer
Visual representation learning is ubiquitous in various real-world applications, including
visual comprehension, video understanding, multi-modal analysis, human-computer …

Semi-supervised and unsupervised deep visual learning: A survey

Y Chen, M Mancini, X Zhu… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
State-of-the-art deep learning models are often trained with a large amount of costly labeled
training data. However, requiring exhaustive manual annotations may degrade the model's …

Avoid-df: Audio-visual joint learning for detecting deepfake

W Yang, X Zhou, Z Chen, B Guo, Z Ba… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
Recently, deepfakes have raised severe concerns about the authenticity of online media.
Prior works for deepfake detection have made many efforts to capture the intra-modal …

Sound to visual scene generation by audio-to-visual latent alignment

K Sung-Bin, A Senocak, H Ha… - Proceedings of the …, 2023 - openaccess.thecvf.com
How does audio describe the world around us? In this paper, we propose a method for
generating an image of a scene from sound. Our method addresses the challenges of …

Audio-visual generalised zero-shot learning with cross-modal attention and language

OB Mercea, L Riesch, A Koepke… - Proceedings of the …, 2022 - openaccess.thecvf.com
Learning to classify video data from classes not included in the training data, ie video-based
zero-shot learning, is challenging. We conjecture that the natural alignment between the …

Sound-guided semantic image manipulation

SH Lee, W Roh, W Byeon, SH Yoon… - Proceedings of the …, 2022 - openaccess.thecvf.com
The recent success of the generative model shows that leveraging the multi-modal
embedding space can manipulate an image using text information. However, manipulating …

Integrating language guidance into vision-based deep metric learning

K Roth, O Vinyals, Z Akata - Proceedings of the IEEE/CVF …, 2022 - openaccess.thecvf.com
Abstract Deep Metric Learning (DML) proposes to learn metric spaces which encode
semantic similarities as embedding space distances. These spaces should be transferable …

Self-supervised predictive learning: A negative-free method for sound source localization in visual scenes

Z Song, Y Wang, J Fan, T Tan, Z Zhang - arxiv preprint arxiv:2203.13412, 2022 - arxiv.org
Sound source localization in visual scenes aims to localize objects emitting the sound in a
given image. Recent works showing impressive localization performance typically rely on …

Modality-aware contrastive instance learning with self-distillation for weakly-supervised audio-visual violence detection

J Yu, J Liu, Y Cheng, R Feng, Y Zhang - Proceedings of the 30th ACM …, 2022 - dl.acm.org
Weakly-supervised audio-visual violence detection aims to distinguish snippets containing
multimodal violence events with video-level labels. Many prior works perform audio-visual …