A deep cross-modality hashing network for SAR and optical remote sensing images retrieval

W **ong, Z **ong, Y Zhang, Y Cui… - IEEE Journal of Selected …, 2020 - ieeexplore.ieee.org
The content-based remote sensing image retrieval (CBRSIR) has recently become a hot
topic due to its wide applications in analysis of remote sensing data. However, since …

Conditioned source separation for musical instrument performances

O Slizovskaia, G Haro, E Gómez - IEEE/ACM Transactions on …, 2021 - ieeexplore.ieee.org
In music source separation, the number of sources may vary for each piece and some of the
sources may belong to the same family of instruments, thus sharing timbral characteristics …

Less can be more: Sound source localization with a classification model

A Senocak, H Ryu, J Kim… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
In this paper, we tackle sound localization as a natural outcome of the audio-visual video
classification problem. Differently from the existing sound localization approaches, we do not …

Large scale audiovisual learning of sounds with weakly labeled data

HM Fayek, A Kumar - arxiv preprint arxiv:2006.01595, 2020 - arxiv.org
Recognizing sounds is a key aspect of computational audio scene analysis and machine
perception. In this paper, we advocate that sound recognition is inherently a multi-modal …

Cross-modal music-video recommendation: A study of design choices

L Prétet, G Richard, G Peeters - 2021 International Joint …, 2021 - ieeexplore.ieee.org
In this work, we study music/video cross-modal recommendation, ie recommending a music
track for a video or vice versa. We rely on a self-supervised learning paradigm to learn from …

SSLNet: A network for cross-modal sound source localization in visual scenes

F Feng, Y Ming, N Hu - Neurocomputing, 2022 - Elsevier
Sound source localization in visual scenes is to associate sounds and their visual
producers. Although great progress has been made in this field, the mixed sounds from …

Tribert: Full-body human-centric audio-visual representation learning for visual sound separation

T Rahman, M Yang, L Sigal - arxiv preprint arxiv:2110.13412, 2021 - arxiv.org
The recent success of transformer models in language, such as BERT, has motivated the
use of such architectures for multi-modal feature learning and tasks. However, most multi …

Unsupervised synthetic acoustic image generation for audio-visual scene understanding

V Sanguineti, P Morerio, A Del Bue… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Acoustic images are an emergent data modality for multimodal scene understanding. Such
images have the peculiarity of distinguishing the spectral signature of the sound coming …

Multimodal Alignment and Fusion: A Survey

S Li, H Tang - arxiv preprint arxiv:2411.17040, 2024 - arxiv.org
This survey offers a comprehensive review of recent advancements in multimodal alignment
and fusion within machine learning, spurred by the growing diversity of data types such as …

TriBERT: Human-centric audio-visual representation learning

T Rahman, M Yang, L Sigal - Advances in Neural …, 2021 - proceedings.neurips.cc
The recent success of transformer models in language, such as BERT, has motivated the
use of such architectures for multi-modal feature learning and tasks. However, most multi …