- Academic Search

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier

The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

Lagre Referanse Sitert av 242 Beslektede artikler Alle 7 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org

Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

Lagre Referanse Sitert av 409 Beslektede artikler Alle 10 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Ego4d: Around the world in 3,000 hours of egocentric video

K Grauman, A Westbury, E Byrne… - Proceedings of the …, 2022 - openaccess.thecvf.com

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …

Lagre Referanse Sitert av 1012 Beslektede artikler Alle 20 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Diffused heads: Diffusion models beat gans on talking-face generation

M Stypułkowski, K Vougioukas, S He… - Proceedings of the …, 2024 - openaccess.thecvf.com

Talking face generation has historically struggled to produce head movements and natural
facial expressions without guidance from additional reference videos. Recent developments …

Lagre Referanse Sitert av 138 Beslektede artikler Alle 6 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Attention bottlenecks for multimodal fusion

A Nagrani, S Yang, A Arnab, A Jansen… - Advances in neural …, 2021 - proceedings.neurips.cc

Humans perceive the world by concurrently processing and fusing high-dimensional inputs
from multiple modalities such as vision and audio. Machine perception models, in stark …

Lagre Referanse Sitert av 646 Beslektede artikler Alle 7 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

What makes multi-modal learning better than single (provably)

Y Huang, C Du, Z Xue, X Chen… - Advances in Neural …, 2021 - proceedings.neurips.cc

The world provides us with data of multiple modalities. Intuitively, models fusing data from
different modalities outperform their uni-modal counterparts, since more information is …

Lagre Referanse Sitert av 303 Beslektede artikler Alle 8 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

CelebV-HQ: A large-scale video facial attributes dataset

H Zhu, W Wu, W Zhu, L Jiang, S Tang, L Zhang… - European conference on …, 2022 - Springer

Large-scale datasets have played indispensable roles in the recent success of face
generation/editing and significantly facilitated the advances of emerging research fields …

Lagre Referanse Sitert av 131 Beslektede artikler Alle 7 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Neural target speech extraction: An overview

K Zmolikova, M Delcroix, T Ochiai… - IEEE Signal …, 2023 - ieeexplore.ieee.org

Humans can listen to a target speaker even in challenging acoustic conditions that have
noise, reverberation, and interfering speakers. This phenomenon is known as the cocktail …

Lagre Referanse Sitert av 90 Beslektede artikler Alle 5 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Visual speech recognition for multiple languages in the wild

P Ma, S Petridis, M Pantic - Nature Machine Intelligence, 2022 - nature.com

Visual speech recognition (VSR) aims to recognize the content of speech based on lip
movements, without relying on the audio stream. Advances in deep learning and the …

Lagre Referanse Sitert av 145 Beslektede artikler Alle 7 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Auto-avsr: Audio-visual speech recognition with automatic labels

P Ma, A Haliassos, A Fernandez-Lopez… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org

Audio-visual speech recognition has received a lot of attention due to its robustness against
acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech …

Lagre Referanse Sitert av 118 Beslektede artikler Alle 3 versjoner

Opprett varsel

Referanse

Avansert søk

Lagret i Mitt bibliotek

Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech...

A review of deep learning techniques for speech processing

Self-supervised speech representation learning: A review

Ego4d: Around the world in 3,000 hours of egocentric video

Diffused heads: Diffusion models beat gans on talking-face generation

Attention bottlenecks for multimodal fusion

What makes multi-modal learning better than single (provably)

CelebV-HQ: A large-scale video facial attributes dataset

Neural target speech extraction: An overview

Visual speech recognition for multiple languages in the wild

Auto-avsr: Audio-visual speech recognition with automatic labels