[HTML][HTML] Thank you for attention: A survey on attention-based artificial neural networks for automatic speech recognition

P Karmakar, SW Teng, G Lu - Intelligent Systems with Applications, 2024‏ - Elsevier
Attention is a very popular and effective mechanism in artificial neural network-based
sequence-to-sequence models. In this survey paper, a comprehensive review of the different …

Libriheavy: A 50,000 hours ASR corpus with punctuation casing and context

W Kang, X Yang, Z Yao, F Kuang… - ICASSP 2024-2024 …, 2024‏ - ieeexplore.ieee.org
In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting of 50,000 hours
of read English speech derived from LibriVox. To the best of our knowledge, Libriheavy is …

An embarrassingly simple approach for LLM with strong ASR capacity

Z Ma, G Yang, Y Yang, Z Gao, J Wang, Z Du… - arxiv preprint arxiv …, 2024‏ - arxiv.org
In this paper, we focus on solving one of the most important tasks in the field of speech
processing, ie, automatic speech recognition (ASR), with speech foundation encoders and …

Towards universal speech discrete tokens: A case study for asr and tts

Y Yang, F Shen, C Du, Z Ma, K Yu… - ICASSP 2024-2024 …, 2024‏ - ieeexplore.ieee.org
Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into
utilizing discrete tokens for speech tasks like recognition and translation, which offer lower …

Vall-t: Decoder-only generative transducer for robust and decoding-controllable text-to-speech

C Du, Y Guo, H Wang, Y Yang, Z Niu, S Wang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and
VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot …

Exploring the capability of mamba in speech applications

K Miyazaki, Y Masuyama, M Murata - arxiv preprint arxiv:2406.16808, 2024‏ - arxiv.org
This paper explores the capability of Mamba, a recently proposed architecture based on
state space models (SSMs), as a competitive alternative to Transformer-based models. In …

GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement

Y Yang, Z Song, J Zhuo, M Cui, J Li, B Yang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
The evolution of speech technology has been spurred by the rapid increase in dataset sizes.
Traditional speech models generally depend on a large amount of labeled training data …

PromptASR for contextualized ASR with controllable style

X Yang, W Kang, Z Yao, Y Yang, L Guo… - ICASSP 2024-2024 …, 2024‏ - ieeexplore.ieee.org
Prompts are crucial to large language models as they provide context information such as
topic or logical relationships. Inspired by this, we propose PromptASR, a framework that …

Spontaneous style text-to-speech synthesis with controllable spontaneous behaviors based on language models

W Li, P Yang, Y Zhong, Y Zhou, Z Wang, Z Wu… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Spontaneous style speech synthesis, which aims to generate human-like speech, often
encounters challenges due to the scarcity of high-quality data and limitations in model …

LibriheavyMix: a 20,000-hour dataset for single-channel reverberant multi-talker speech separation, ASR and speaker diarization

Z **, Y Yang, M Shi, W Kang, X Yang, Z Yao… - arxiv preprint arxiv …, 2024‏ - arxiv.org
The evolving speech processing landscape is increasingly focused on complex scenarios
like meetings or cocktail parties with multiple simultaneous speakers and far-field conditions …