Robust wav2vec 2.0: Analyzing domain shift in self-supervised pre-training

WN Hsu, A Sriram, A Baevski, T Likhomanenko… - arxiv preprint arxiv …, 2021 - arxiv.org
Self-supervised learning of speech representations has been a very active research area
but most work is focused on a single domain such as read audio books for which there exist …

Textually pretrained speech language models

M Hassid, T Remez, TA Nguyen, I Gat… - Advances in …, 2024 - proceedings.neurips.cc
Speech language models (SpeechLMs) process and generate acoustic data only, without
textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using …

Audiobox: Unified audio generation with natural language prompts

A Vyas, B Shi, M Le, A Tjandra, YC Wu, B Guo… - arxiv preprint arxiv …, 2023 - arxiv.org
Audio is an essential part of our life, but creating it often requires expertise and is time-
consuming. Research communities have made great progress over the past year advancing …

Titanet: Neural model for speaker representation with 1d depth-wise separable convolutions and global context

NR Koluguri, T Park, B Ginsburg - ICASSP 2022-2022 IEEE …, 2022 - ieeexplore.ieee.org
In this paper, we propose TitaNet, a novel neural network architecture for extracting speaker
representations. We employ 1D depth-wise separable convolutions with Squeeze-and …

Rethinking evaluation in asr: Are our models robust enough?

T Likhomanenko, Q Xu, V Pratap, P Tomasello… - arxiv preprint arxiv …, 2020 - arxiv.org
Is pushing numbers on a single benchmark valuable in automatic speech recognition?
Research results in acoustic modeling are typically evaluated based on performance on a …

Word informativity influences acoustic duration: Effects of contextual predictability on lexical representation

S Seyfarth - Cognition, 2014 - Elsevier
Abstract Language-users reduce words in predictable contexts. Previous research indicates
that reduction may be stored in lexical representation if a word is often reduced. Because …

A Review of Natural-Language-Instructed Robot Execution Systems

R Liu, Y Guo, R **, X Zhang - AI, 2024 - mdpi.com
It is natural and efficient to use human natural language (NL) directly to instruct robot task
executions without prior user knowledge of instruction patterns. Currently, NL-instructed …

slimipl: Language-model-free iterative pseudo-labeling

T Likhomanenko, Q Xu, J Kahn, G Synnaeve… - arxiv preprint arxiv …, 2020 - arxiv.org
Recent results in end-to-end automatic speech recognition have demonstrated the efficacy
of pseudo-labeling for semi-supervised models trained both with Connectionist Temporal …

Performance-efficiency trade-offs in unsupervised pre-training for speech recognition

F Wu, K Kim, J Pan, KJ Han… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic
speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture …

Individual differences in artificial and natural language statistical learning

ES Isbilen, SM McCauley, MH Christiansen - Cognition, 2022 - Elsevier
Statistical learning (SL) is considered a cornerstone of cognition. While decades of research
have unveiled the remarkable breadth of structures that participants can learn from statistical …