Voicecraft: Zero-shot speech editing and text-to-speech in the wild

P Peng, PY Huang, SW Li, A Mohamed… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-
of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on …

[HTML][HTML] Thank you for attention: a survey on attention-based artificial neural networks for automatic speech recognition

P Karmakar, SW Teng, G Lu - Intelligent Systems with Applications, 2024 - Elsevier
Attention is a very popular and effective mechanism in artificial neural network-based
sequence-to-sequence models. In this survey paper, a comprehensive review of the different …

Exploring the capability of mamba in speech applications

K Miyazaki, Y Masuyama, M Murata - arxiv preprint arxiv:2406.16808, 2024 - arxiv.org
This paper explores the capability of Mamba, a recently proposed architecture based on
state space models (SSMs), as a competitive alternative to Transformer-based models. In …

Libriheavy: a 50,000 hours asr corpus with punctuation casing and context

W Kang, X Yang, Z Yao, F Kuang… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting of 50,000 hours
of read English speech derived from LibriVox. To the best of our knowledge, Libriheavy is …

Towards universal speech discrete tokens: A case study for asr and tts

Y Yang, F Shen, C Du, Z Ma, K Yu… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into
utilizing discrete tokens for speech tasks like recognition and translation, which offer lower …

Spontaneous style text-to-speech synthesis with controllable spontaneous behaviors based on language models

W Li, P Yang, Y Zhong, Y Zhou, Z Wang, Z Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
Spontaneous style speech synthesis, which aims to generate human-like speech, often
encounters challenges due to the scarcity of high-quality data and limitations in model …

LibriheavyMix: a 20,000-hour dataset for single-channel reverberant multi-talker speech separation, ASR and speaker diarization

Z **, Y Yang, M Shi, W Kang, X Yang, Z Yao… - arxiv preprint arxiv …, 2024 - arxiv.org
The evolving speech processing landscape is increasingly focused on complex scenarios
like meetings or cocktail parties with multiple simultaneous speakers and far-field conditions …

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

C Du, Y Guo, H Wang, Y Yang, Z Niu, S Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and
VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot …

Convert and speak: Zero-shot accent conversion with minimum supervision

H Xue, X Peng, Y Lu - ACM Multimedia 2024, 2024 - openreview.net
Low resource of parallel data is the key challenge of accent conversion (AC) problem in
which both the pronunciation units and prosody pattern need to be converted. We propose a …

On Speaker Attribution with SURT

D Raj, M Wiesner, M Maciejewski… - arxiv preprint arxiv …, 2024 - arxiv.org
The Streaming Unmixing and Recognition Transducer (SURT) has recently become a
popular framework for continuous, streaming, multi-talker speech recognition (ASR). With …