Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding

Y Peng, S Dalmia, I Lane… - … Conference on Machine …, 2022 - proceedings.mlr.press
Conformer has proven to be effective in many speech processing tasks. It combines the
benefits of extracting local dependencies using convolutions and global dependencies …

Fast conformer with linearly scalable attention for efficient speech recognition

D Rekesh, NR Koluguri, S Kriman… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
Conformer-based models have become the dominant end-to-end architecture for speech
processing tasks. With the objective of enhancing the conformer architecture for efficient …

Prompttts 2: Describing and generating voices with text prompt

Y Leng, Z Guo, K Shen, X Tan, Z Ju, Y Liu, Y Liu… - arxiv preprint arxiv …, 2023 - arxiv.org
Speech conveys more information than just text, as the same word can be uttered in various
voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods …

Can chatgpt detect intent? evaluating large language models for spoken language understanding

M He, PN Garner - arxiv preprint arxiv:2305.13512, 2023 - arxiv.org
Recently, large pretrained language models have demonstrated strong language
understanding capabilities. This is particularly reflected in their zero-shot and in-context …

Cwcl: Cross-modal transfer with continuously weighted contrastive loss

RS Srinivasa, J Cho, C Yang… - Advances in …, 2023 - proceedings.neurips.cc
This paper considers contrastive training for cross-modal 0-shot transfer wherein a pre-
trained model in one modality is used for representation learning in another domain using …

Samu-xlsr: Semantically-aligned multimodal utterance-level cross-lingual speech representation

S Khurana, A Laurent, J Glass - IEEE Journal of Selected …, 2022 - ieeexplore.ieee.org
We propose the (): S emantically-A ligned M ultimodal U tterance-level Cross-L ingual S
peech R epresentation learning framework. Unlike previous works on speech representation …

SLUE phase-2: A benchmark suite of diverse spoken language understanding tasks

S Shon, S Arora, CJ Lin, A Pasad, F Wu… - arxiv preprint arxiv …, 2022 - arxiv.org
Spoken language understanding (SLU) tasks have been studied for many decades in the
speech research community, but have not received as much attention as lower-level tasks …

Exploring the capability of mamba in speech applications

K Miyazaki, Y Masuyama, M Murata - arxiv preprint arxiv:2406.16808, 2024 - arxiv.org
This paper explores the capability of Mamba, a recently proposed architecture based on
state space models (SSMs), as a competitive alternative to Transformer-based models. In …

BERT meets CTC: New formulation of end-to-end speech recognition with pre-trained masked language model

Y Higuchi, B Yan, S Arora, T Ogawa… - arxiv preprint arxiv …, 2022 - arxiv.org
This paper presents BERT-CTC, a novel formulation of end-to-end speech recognition that
adapts BERT for connectionist temporal classification (CTC). Our formulation relaxes the …

Structured pruning of self-supervised pre-trained models for speech recognition and understanding

Y Peng, K Kim, F Wu, P Sridhar… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Self-supervised speech representation learning (SSL) has shown to be effective in various
downstream tasks, but SSL models are usually large and slow. Model compression …