Transformers in speech processing: A survey

S Latif, A Zaidi, H Cuayahuitl, F Shamshad… - arxiv preprint arxiv …, 2023 - arxiv.org
The remarkable success of transformers in the field of natural language processing has
sparked the interest of the speech-processing community, leading to an exploration of their …

A transformer-based model with self-distillation for multimodal emotion recognition in conversations

H Ma, J Wang, H Lin, B Zhang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Emotion recognition in conversations (ERC), the task of recognizing the emotion of each
utterance in a conversation, is crucial for building empathetic machines. Existing studies …

[PDF][PDF] End-to-end japanese multi-dialect speech recognition and dialect identification with multi-task learning

R Imaizumi, R Masumura, S Shiota… - … Transactions on Signal …, 2022 - nowpublishers.com
End-to-end systems have demonstrated state-of-the-art performance on many tasks related
to automatic speech recognition (ASR) and dialect identification (DID). In this paper, we …

Self-Distillation Regularized Connectionist Temporal Classification Loss for Text Recognition: A Simple Yet Effective Approach

Z Zhang, N Lu, M Liao, Y Huang, C Li… - Proceedings of the …, 2024 - ojs.aaai.org
Text recognition methods are gaining rapid development. Some advanced techniques, eg,
powerful modules, language models, and un-and semi-supervised learning schemes …

Multi-encoder learning and stream fusion for transformer-based end-to-end automatic speech recognition

T Lohrenz, Z Li, T Fingscheidt - arxiv preprint arxiv:2104.00120, 2021 - arxiv.org
Stream fusion, also known as system combination, is a common technique in automatic
speech recognition for traditional hybrid hidden Markov model approaches, yet mostly …

Layer pruning on demand with intermediate CTC

J Lee, J Kang, S Watanabe - arxiv preprint arxiv:2106.09216, 2021 - arxiv.org
Deploying an end-to-end automatic speech recognition (ASR) model on mobile/embedded
devices is a challenging task, since the device computational power and energy …

Alignment knowledge distillation for online streaming attention-based speech recognition

H Inaguma, T Kawahara - IEEE/ACM Transactions on Audio …, 2021 - ieeexplore.ieee.org
This article describes an efficient training method for online streaming attention-based
encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have …

Relaxed attention: A simple method to boost performance of end-to-end automatic speech recognition

T Lohrenz, P Schwarz, Z Li… - 2021 IEEE Automatic …, 2021 - ieeexplore.ieee.org
Recently, attention-based encoder-decoder (AED) models have shown high performance for
end-to-end automatic speech recognition (ASR) across several tasks. Addressing …

A comparative study on neural architectures and training methods for Japanese speech recognition

S Karita, Y Kubo, MAU Bacchiani, L Jones - arxiv preprint arxiv …, 2021 - arxiv.org
End-to-end (E2E) modeling is advantageous for automatic speech recognition (ASR)
especially for Japanese since word-based tokenization of Japanese is not trivial, and E2E …

Distilling the Knowledge of BERT for CTC-based ASR

H Futami, H Inaguma, M Mimura, S Sakai… - arxiv preprint arxiv …, 2022 - arxiv.org
Connectionist temporal classification (CTC)-based models are attractive because of their
fast inference in automatic speech recognition (ASR). Language model (LM) integration …