- Academic Search

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org

Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

Spara Citera Citerat av 405 Relaterade artiklar Alla 10 versionerna

[Free GPT-4]

[PDF] cell.com Full View

Audio self-supervised learning: A survey

S Liu, A Mallol-Ragolta, E Parada-Cabaleiro, K Qian… - Patterns, 2022 - cell.com

Similar to humans' cognitive ability to generalize knowledge and skills, self-supervised
learning (SSL) targets discovering general representations from large-scale data. This …

Spara Citera Citerat av 127 Relaterade artiklar Alla 12 versionerna

[Free GPT-4]

[PDF] neurips.cc

Attention bottlenecks for multimodal fusion

A Nagrani, S Yang, A Arnab, A Jansen… - Advances in neural …, 2021 - proceedings.neurips.cc

Humans perceive the world by concurrently processing and fusing high-dimensional inputs
from multiple modalities such as vision and audio. Machine perception models, in stark …

Spara Citera Citerat av 643 Relaterade artiklar Alla 8 versionerna Se som HTML-version

Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation

A Ephrat, I Mosseri, O Lang, T Dekel, K Wilson… - arxiv preprint arxiv …, 2018 - arxiv.org

We present a joint audio-visual model for isolating a single speech signal from a mixture of
sounds such as other speakers and background noise. Solving this task using only audio as …

Spara Citera Citerat av 959 Relaterade artiklar Alla 8 versionerna Se som HTML-version

[Free GPT-4]

[PDF] thecvf.com

Look, listen and learn

R Arandjelovic, A Zisserman - Proceedings of the IEEE …, 2017 - openaccess.thecvf.com

We consider the question: what can be learnt by looking at and listening to a large number
of unlabelled videos? There is a valuable, but so far untapped, source of information …

Spara Citera Citerat av 1137 Relaterade artiklar Alla 10 versionerna Se som HTML-version

[Free GPT-4]

[PDF] thecvf.com

Objects that sound

R Arandjelovic, A Zisserman - Proceedings of the European …, 2018 - openaccess.thecvf.com

In this paper our objectives are, first, networks that can embed audio and visual inputs into a
common space that is suitable for cross-modal retrieval; and second, a network that can …

Spara Citera Citerat av 645 Relaterade artiklar Alla 11 versionerna Se som HTML-version

[Free GPT-4]

[PDF] thecvf.com

Audio-visual event localization in unconstrained videos

Y Tian, J Shi, B Li, Z Duan, C Xu - Proceedings of the …, 2018 - openaccess.thecvf.com

In this paper, we introduce a novel problem of audio-visual event localization in
unconstrained videos. We define an audio-visual event as an event that is both visible and …

Spara Citera Citerat av 537 Relaterade artiklar Alla 11 versionerna Se som HTML-version

[Free GPT-4]

[PDF] thecvf.com

Dual attention matching for audio-visual event localization

Y Wu, L Zhu, Y Yan, Y Yang - Proceedings of the IEEE/CVF …, 2019 - openaccess.thecvf.com

In this paper, we investigate the audio-visual event localization problem. This task is to
localize a visible and audible event in a video. Previous methods first divide a video into …

Spara Citera Citerat av 227 Relaterade artiklar Alla 11 versionerna Se som HTML-version

[Free GPT-4]

[PDF] neurips.cc

Learning representations from audio-visual spatial alignment

P Morgado, Y Li… - Advances in Neural …, 2020 - proceedings.neurips.cc

We introduce a novel self-supervised pretext task for learning representations from audio-
visual content. Prior work on audio-visual representation learning leverages …

Spara Citera Citerat av 141 Relaterade artiklar Alla 10 versionerna Se som HTML-version

[Free GPT-4]

[PDF] neurips.cc

TVLT: Textless vision-language transformer

Z Tang, J Cho, Y Nie, M Bansal - Advances in neural …, 2022 - proceedings.neurips.cc

In this work, we present the Textless Vision-Language Transformer (TVLT), where
homogeneous transformer blocks take raw visual and audio inputs for vision-and-language …

Spara Citera Citerat av 37 Relaterade artiklar Alla 7 versionerna Se som HTML-version

Skapa alarm

Citera

Avancerad sökning

Har sparats i Mitt bibliotek

Unsupervised learning of spoken language with visual context

Self-supervised speech representation learning: A review

Audio self-supervised learning: A survey

Attention bottlenecks for multimodal fusion

Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation

Look, listen and learn

Objects that sound

Audio-visual event localization in unconstrained videos

Dual attention matching for audio-visual event localization

Learning representations from audio-visual spatial alignment

TVLT: Textless vision-language transformer