- Academic Search

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier

The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

Speichern Zitieren Zitiert von: 235 Ähnliche Artikel Alle 6 Versionen

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Speaker recognition based on deep learning: An overview

Z Bai, XL Zhang - Neural Networks, 2021 - Elsevier

Speaker recognition is a task of identifying persons from their voices. Recently, deep
learning has dramatically revolutionized speaker recognition. However, there is lack of …

Speichern Zitieren Zitiert von: 443 Ähnliche Artikel Alle 9 Versionen

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Ego4d: Around the world in 3,000 hours of egocentric video

K Grauman, A Westbury, E Byrne… - Proceedings of the …, 2022 - openaccess.thecvf.com

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …

Speichern Zitieren Zitiert von: 991 Ähnliche Artikel Alle 13 Versionen HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone

E Casanova, J Weber, CD Shulby… - International …, 2022 - proceedings.mlr.press

YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker
TTS. Our method builds upon the VITS model and adds several novel modifications for zero …

Speichern Zitieren Zitiert von: 448 Ähnliche Artikel Alle 7 Versionen HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Pose-controllable talking face generation by implicitly modularized audio-visual representation

H Zhou, Y Sun, W Wu, CC Loy… - Proceedings of the …, 2021 - openaccess.thecvf.com

While accurate lip synchronization has been achieved for arbitrary-subject audio-driven
talking face generation, the problem of how to efficiently drive the head pose remains …

Speichern Zitieren Zitiert von: 398 Ähnliche Artikel Alle 9 Versionen HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Expressive talking head generation with granular audio-visual control

B Liang, Y Pan, Z Guo, H Zhou… - Proceedings of the …, 2022 - openaccess.thecvf.com

Generating expressive talking heads is essential for creating virtual humans. However,
existing one-or few-shot methods focus on lip-sync and head motion, ignoring the emotional …

Speichern Zitieren Zitiert von: 136 Ähnliche Artikel Alle 4 Versionen HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mfa-conformer: Multi-scale feature aggregation conformer for automatic speaker verification

Y Zhang, Z Lv, H Wu, S Zhang, P Hu, Z Wu… - arxiv preprint arxiv …, 2022 - arxiv.org

In this paper, we present Multi-scale Feature Aggregation Conformer (MFA-Conformer), an
easy-to-implement, simple but effective backbone for automatic speaker verification based …

Speichern Zitieren Zitiert von: 152 Ähnliche Artikel Alle 6 Versionen HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection

R Tao, Z Pan, RK Das, X Qian, MZ Shou… - Proceedings of the 29th …, 2021 - dl.acm.org

Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or
more speakers. The successful ASD depends on accurate interpretation of short-term and …

Speichern Zitieren Zitiert von: 194 Ähnliche Artikel Alle 5 Versionen

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Learning hierarchical cross-modal association for co-speech gesture generation

X Liu, Q Wu, H Zhou, Y Xu, R Qian… - Proceedings of the …, 2022 - openaccess.thecvf.com

Generating speech-consistent body and gesture movements is a long-standing problem in
virtual avatar creation. Previous studies often synthesize pose movement in a holistic …

Speichern Zitieren Zitiert von: 119 Ähnliche Artikel Alle 5 Versionen HTML-Version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Aishell-3: A multi-speaker mandarin tts corpus and the baselines

Y Shi, H Bu, X Xu, S Zhang, M Li - arxiv preprint arxiv:2010.11567, 2020 - arxiv.org

In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin
speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems …

Speichern Zitieren Zitiert von: 273 Ähnliche Artikel Alle 8 Versionen HTML-Version

Alert erstellen

Zitieren

Erweiterte Suche

In „Meine Bibliothek“ gespeichert

In defence of metric learning for speaker recognition

A review of deep learning techniques for speech processing

Speaker recognition based on deep learning: An overview

Ego4d: Around the world in 3,000 hours of egocentric video

Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone

Pose-controllable talking face generation by implicitly modularized audio-visual representation

Expressive talking head generation with granular audio-visual control

Mfa-conformer: Multi-scale feature aggregation conformer for automatic speaker verification

Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection

Learning hierarchical cross-modal association for co-speech gesture generation

Aishell-3: A multi-speaker mandarin tts corpus and the baselines