Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding
Conformer has proven to be effective in many speech processing tasks. It combines the
benefits of extracting local dependencies using convolutions and global dependencies …
benefits of extracting local dependencies using convolutions and global dependencies …
Fast conformer with linearly scalable attention for efficient speech recognition
D Rekesh, NR Koluguri, S Kriman… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
Conformer-based models have become the dominant end-to-end architecture for speech
processing tasks. With the objective of enhancing the conformer architecture for efficient …
processing tasks. With the objective of enhancing the conformer architecture for efficient …
Prompttts 2: Describing and generating voices with text prompt
Speech conveys more information than just text, as the same word can be uttered in various
voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods …
voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods …
Can chatgpt detect intent? evaluating large language models for spoken language understanding
Recently, large pretrained language models have demonstrated strong language
understanding capabilities. This is particularly reflected in their zero-shot and in-context …
understanding capabilities. This is particularly reflected in their zero-shot and in-context …
Cwcl: Cross-modal transfer with continuously weighted contrastive loss
This paper considers contrastive training for cross-modal 0-shot transfer wherein a pre-
trained model in one modality is used for representation learning in another domain using …
trained model in one modality is used for representation learning in another domain using …
Samu-xlsr: Semantically-aligned multimodal utterance-level cross-lingual speech representation
We propose the (): S emantically-A ligned M ultimodal U tterance-level Cross-L ingual S
peech R epresentation learning framework. Unlike previous works on speech representation …
peech R epresentation learning framework. Unlike previous works on speech representation …
SLUE phase-2: A benchmark suite of diverse spoken language understanding tasks
Spoken language understanding (SLU) tasks have been studied for many decades in the
speech research community, but have not received as much attention as lower-level tasks …
speech research community, but have not received as much attention as lower-level tasks …
Exploring the capability of mamba in speech applications
This paper explores the capability of Mamba, a recently proposed architecture based on
state space models (SSMs), as a competitive alternative to Transformer-based models. In …
state space models (SSMs), as a competitive alternative to Transformer-based models. In …
BERT meets CTC: New formulation of end-to-end speech recognition with pre-trained masked language model
This paper presents BERT-CTC, a novel formulation of end-to-end speech recognition that
adapts BERT for connectionist temporal classification (CTC). Our formulation relaxes the …
adapts BERT for connectionist temporal classification (CTC). Our formulation relaxes the …
Structured pruning of self-supervised pre-trained models for speech recognition and understanding
Self-supervised speech representation learning (SSL) has shown to be effective in various
downstream tasks, but SSL models are usually large and slow. Model compression …
downstream tasks, but SSL models are usually large and slow. Model compression …