A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

Robust speech recognition via large-scale weak supervision

A Radford, JW Kim, T Xu, G Brockman… - International …, 2023 - proceedings.mlr.press
We study the capabilities of speech processing systems trained simply to predict large
amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual …

Audioldm 2: Learning holistic audio generation with self-supervised pretraining

H Liu, Y Yuan, X Liu, X Mei, Q Kong… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Although audio generation shares commonalities across different types of audio, such as
speech, music, and sound effects, designing models for each type requires careful …

Wavlm: Large-scale self-supervised pre-training for full stack speech processing

S Chen, C Wang, Z Chen, Y Wu, S Liu… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Self-supervised learning (SSL) achieves great success in speech recognition, while limited
exploration has been attempted for other speech processing tasks. As speech signal …

Ego4d: Around the world in 3,000 hours of egocentric video

K Grauman, A Westbury, E Byrne… - Proceedings of the …, 2022 - openaccess.thecvf.com
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …

Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition

B Zhang, H Lv, P Guo, Q Shao, C Yang… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of
10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about …

The singing voice conversion challenge 2023

WC Huang, LP Violeta, S Liu, J Shi… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
We present the latest iteration of the voice conversion challenge (VCC) series, a bi-annual
scientific event aiming to compare and understand different voice conversion (VC) systems …

Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec

Z Du, S Zhang, K Hu, S Zheng - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org
This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an
extension of the open-source speech processing toolkit FunASR. FunCodec provides …

Connecting speech encoder and large language model for asr

W Yu, C Tang, G Sun, X Chen, T Tan… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
The impressive capability and versatility of large language models (LLMs) have aroused
increasing attention in automatic speech recognition (ASR), with several pioneering studies …