Voicecraft: Zero-shot speech editing and text-to-speech in the wild
We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-
of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on …
of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on …
[HTML][HTML] Thank you for attention: a survey on attention-based artificial neural networks for automatic speech recognition
Attention is a very popular and effective mechanism in artificial neural network-based
sequence-to-sequence models. In this survey paper, a comprehensive review of the different …
sequence-to-sequence models. In this survey paper, a comprehensive review of the different …
Exploring the capability of mamba in speech applications
This paper explores the capability of Mamba, a recently proposed architecture based on
state space models (SSMs), as a competitive alternative to Transformer-based models. In …
state space models (SSMs), as a competitive alternative to Transformer-based models. In …
Libriheavy: a 50,000 hours asr corpus with punctuation casing and context
In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting of 50,000 hours
of read English speech derived from LibriVox. To the best of our knowledge, Libriheavy is …
of read English speech derived from LibriVox. To the best of our knowledge, Libriheavy is …
Towards universal speech discrete tokens: A case study for asr and tts
Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into
utilizing discrete tokens for speech tasks like recognition and translation, which offer lower …
utilizing discrete tokens for speech tasks like recognition and translation, which offer lower …
Spontaneous style text-to-speech synthesis with controllable spontaneous behaviors based on language models
Spontaneous style speech synthesis, which aims to generate human-like speech, often
encounters challenges due to the scarcity of high-quality data and limitations in model …
encounters challenges due to the scarcity of high-quality data and limitations in model …
LibriheavyMix: a 20,000-hour dataset for single-channel reverberant multi-talker speech separation, ASR and speaker diarization
The evolving speech processing landscape is increasingly focused on complex scenarios
like meetings or cocktail parties with multiple simultaneous speakers and far-field conditions …
like meetings or cocktail parties with multiple simultaneous speakers and far-field conditions …
VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech
Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and
VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot …
VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot …
Convert and speak: Zero-shot accent conversion with minimum supervision
H Xue, X Peng, Y Lu - ACM Multimedia 2024, 2024 - openreview.net
Low resource of parallel data is the key challenge of accent conversion (AC) problem in
which both the pronunciation units and prosody pattern need to be converted. We propose a …
which both the pronunciation units and prosody pattern need to be converted. We propose a …
On Speaker Attribution with SURT
The Streaming Unmixing and Recognition Transducer (SURT) has recently become a
popular framework for continuous, streaming, multi-talker speech recognition (ASR). With …
popular framework for continuous, streaming, multi-talker speech recognition (ASR). With …