- Academic Search

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier

The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

Gem Citer Citeret af 242 Relaterede artikler Alle 7 versioner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Self-supervised learning for videos: A survey

MC Schiappa, YS Rawat, M Shah - ACM Computing Surveys, 2023 - dl.acm.org

The remarkable success of deep learning in various domains relies on the availability of
large-scale annotated datasets. However, obtaining annotations is expensive and requires …

Gem Citer Citeret af 155 Relaterede artikler Alle 4 versioner

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Vasa-1: Lifelike audio-driven talking faces generated in real time

S Xu, G Chen, YX Guo, J Yang, C Li… - Advances in …, 2025 - proceedings.neurips.cc

We introduce VASA, a framework for generating lifelike talking faces with appealing visual
affective skills (VAS) given a single static image and a speech audio clip. Our premiere …

Gem Citer Citeret af 62 Relaterede artikler Alle 5 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Wavlm: Large-scale self-supervised pre-training for full stack speech processing

S Chen, C Wang, Z Chen, Y Wu, S Liu… - IEEE Journal of …, 2022 - ieeexplore.ieee.org

Self-supervised learning (SSL) achieves great success in speech recognition, while limited
exploration has been attempted for other speech processing tasks. As speech signal …

Gem Citer Citeret af 1862 Relaterede artikler Alle 7 versioner

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Ego4d: Around the world in 3,000 hours of egocentric video

K Grauman, A Westbury, E Byrne… - Proceedings of the …, 2022 - openaccess.thecvf.com

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …

Gem Citer Citeret af 1009 Relaterede artikler Alle 20 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone

E Casanova, J Weber, CD Shulby… - International …, 2022 - proceedings.mlr.press

YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker
TTS. Our method builds upon the VITS model and adds several novel modifications for zero …

Gem Citer Citeret af 447 Relaterede artikler Alle 7 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

SpeechBrain: A general-purpose speech toolkit

M Ravanelli, T Parcollet, P Plantinga, A Rouhe… - arxiv preprint arxiv …, 2021 - arxiv.org

SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the
research and development of neural speech processing technologies by being simple …

Gem Citer Citeret af 793 Relaterede artikler Alle 5 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Emoca: Emotion driven monocular face capture and animation

R Daněček, MJ Black, T Bolkart - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

As 3D facial avatars become more widely used for communication, it is critical that they
faithfully convey emotion. Unfortunately, the best recent methods that regress parametric 3D …

Gem Citer Citeret af 205 Relaterede artikler Alle 7 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

J Yamagishi, X Wang, M Todisco, M Sahidullah… - arxiv preprint arxiv …, 2021 - arxiv.org

ASVspoof 2021 is the forth edition in the series of bi-annual challenges which aim to
promote the study of spoofing and the design of countermeasures to protect automatic …

Gem Citer Citeret af 403 Relaterede artikler Alle 10 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Learning audio-visual speech representation by masked multimodal cluster prediction

B Shi, WN Hsu, K Lakhotia, A Mohamed - arxiv preprint arxiv:2201.02184, 2022 - arxiv.org

Video recordings of speech contain correlated audio and visual information, providing a
strong signal for speech representation learning from the speaker's lip movements and the …

Gem Citer Citeret af 331 Relaterede artikler Alle 3 versioner Vis som HTML

Opret underretning

Citer

Avanceret søgning

Gemt i Min samling

Voxceleb2: Deep speaker recognition

A review of deep learning techniques for speech processing

Self-supervised learning for videos: A survey

Vasa-1: Lifelike audio-driven talking faces generated in real time

Wavlm: Large-scale self-supervised pre-training for full stack speech processing

Ego4d: Around the world in 3,000 hours of egocentric video

Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone

SpeechBrain: A general-purpose speech toolkit

Emoca: Emotion driven monocular face capture and animation

ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

Learning audio-visual speech representation by masked multimodal cluster prediction