- Academic Search

P Peng, PY Huang, SW Li, A Mohamed… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-
of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on …

Opslaan Citeren Geciteerd door 47 Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]

[HTML] sciencedirect.com

[HTML][HTML] Thank you for attention: a survey on attention-based artificial neural networks for automatic speech recognition

P Karmakar, SW Teng, G Lu - Intelligent Systems with Applications, 2024 - Elsevier

Attention is a very popular and effective mechanism in artificial neural network-based
sequence-to-sequence models. In this survey paper, a comprehensive review of the different …

Opslaan Citeren Geciteerd door 39 Verwante artikelen Alle 3 versies

[Free GPT-4]

[PDF] arxiv.org

Exploring the capability of mamba in speech applications

K Miyazaki, Y Masuyama, M Murata - arxiv preprint arxiv:2406.16808, 2024 - arxiv.org

This paper explores the capability of Mamba, a recently proposed architecture based on
state space models (SSMs), as a competitive alternative to Transformer-based models. In …

Opslaan Citeren Geciteerd door 11 Verwante artikelen Alle 3 versies HTML-versie

[Free GPT-4]

[PDF] arxiv.org

Libriheavy: a 50,000 hours asr corpus with punctuation casing and context

W Kang, X Yang, Z Yao, F Kuang… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org

In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting of 50,000 hours
of read English speech derived from LibriVox. To the best of our knowledge, Libriheavy is …

Opslaan Citeren Geciteerd door 35 Verwante artikelen Alle 8 versies

[Free GPT-4]

[PDF] arxiv.org

Towards universal speech discrete tokens: A case study for asr and tts

Y Yang, F Shen, C Du, Z Ma, K Yu… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org

Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into
utilizing discrete tokens for speech tasks like recognition and translation, which offer lower …

Opslaan Citeren Geciteerd door 23 Verwante artikelen Alle 8 versies

[Free GPT-4]

[PDF] arxiv.org

Spontaneous style text-to-speech synthesis with controllable spontaneous behaviors based on language models

W Li, P Yang, Y Zhong, Y Zhou, Z Wang, Z Wu… - arxiv preprint arxiv …, 2024 - arxiv.org

Spontaneous style speech synthesis, which aims to generate human-like speech, often
encounters challenges due to the scarcity of high-quality data and limitations in model …

Opslaan Citeren Geciteerd door 4 Verwante artikelen Alle 3 versies HTML-versie

[Free GPT-4]

[PDF] arxiv.org

LibriheavyMix: a 20,000-hour dataset for single-channel reverberant multi-talker speech separation, ASR and speaker diarization

Z **, Y Yang, M Shi, W Kang, X Yang, Z Yao… - arxiv preprint arxiv …, 2024 - arxiv.org

The evolving speech processing landscape is increasingly focused on complex scenarios
like meetings or cocktail parties with multiple simultaneous speakers and far-field conditions …

Opslaan Citeren Geciteerd door 2 Verwante artikelen Alle 3 versies HTML-versie

[Free GPT-4]

[PDF] arxiv.org

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

C Du, Y Guo, H Wang, Y Yang, Z Niu, S Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and
VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot …

Opslaan Citeren Geciteerd door 19 Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]

[PDF] openreview.net

Convert and speak: Zero-shot accent conversion with minimum supervision

H Xue, X Peng, Y Lu - ACM Multimedia 2024, 2024 - openreview.net

Low resource of parallel data is the key challenge of accent conversion (AC) problem in
which both the pronunciation units and prosody pattern need to be converted. We propose a …

Opslaan Citeren Geciteerd door 2 Verwante artikelen HTML-versie

[Free GPT-4]

[PDF] arxiv.org

On Speaker Attribution with SURT

D Raj, M Wiesner, M Maciejewski… - arxiv preprint arxiv …, 2024 - arxiv.org

The Streaming Unmixing and Recognition Transducer (SURT) has recently become a
popular framework for continuous, streaming, multi-talker speech recognition (ASR). With …

Opslaan Citeren Geciteerd door 4 Verwante artikelen Alle 2 versies HTML-versie

Melding maken

Citeren

Geavanceerd zoeken

Opgeslagen in Mijn bibliotheek

Zipformer: A faster and better encoder for automatic speech recognition

Voicecraft: Zero-shot speech editing and text-to-speech in the wild

[HTML][HTML] Thank you for attention: a survey on attention-based artificial neural networks for automatic speech recognition

Exploring the capability of mamba in speech applications

Libriheavy: a 50,000 hours asr corpus with punctuation casing and context

Towards universal speech discrete tokens: A case study for asr and tts

Spontaneous style text-to-speech synthesis with controllable spontaneous behaviors based on language models

LibriheavyMix: a 20,000-hour dataset for single-channel reverberant multi-talker speech separation, ASR and speaker diarization

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

Convert and speak: Zero-shot accent conversion with minimum supervision

On Speaker Attribution with SURT