Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

Text-free image-to-speech synthesis using learned segmental units

WN Hsu, D Harwath, C Song, J Glass - arxiv preprint arxiv:2012.15454, 2020 - arxiv.org
In this paper we present the first model for directly synthesizing fluent, natural-sounding
spoken audio captions for images that does not require natural language text as an …

Learning hierarchical discrete linguistic units from visually-grounded speech

D Harwath, WN Hsu, J Glass - arxiv preprint arxiv:1911.09602, 2019 - arxiv.org
In this paper, we present a method for learning discrete linguistic units by incorporating
vector quantization layers into neural models of visually grounded speech. We show that our …

Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques

G Chrupała - Journal of Artificial Intelligence Research, 2022 - jair.org
This survey provides an overview of the evolution of visually grounded models of spoken
language over the last 20 years. Such models are inspired by the observation that when …

Learning english with peppa pig

M Nikolaus, A Alishahi, G Chrupała - Transactions of the Association …, 2022 - direct.mit.edu
Recent computational models of the acquisition of spoken language via grounding in
perception exploit associations between spoken and visual modalities and learn to …

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning?--A computational investigation

K Khorrami, O Räsänen - arxiv preprint arxiv:2109.14200, 2021 - arxiv.org
Decades of research has studied how language learning infants learn to discriminate
speech sounds, segment words, and associate words with their meanings. While gradual …

M-SpeechCLIP: Leveraging large-scale, pre-trained models for multilingual speech to image retrieval

L Berry, YJ Shih, HF Wang, HJ Chang… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
This work investigates the use of large-scale, English-only pre-trained models (CLIP and
HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval …

Trilingual semantic embeddings of visually grounded speech with self-attention mechanisms

Y Ohishi, A Kimura, T Kawanishi… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
We propose a trilingual semantic embedding model that associates visual objects in images
with segments of speech signals corresponding to spoken words in an unsupervised …

A spoken language dataset of descriptions for speech-based grounded language learning

GY Kebe, P Higgins, P Jenkins, K Darvish… - Advances in neural …, 2021 - par.nsf.gov
Grounded language acquisition is a major area of research combining aspects of natural
language processing, computer vision, and signal processing, compounded by domain …

Word recognition, competition, and activation in a model of visually grounded speech

WN Havard, JP Chevrot, L Besacier - arxiv preprint arxiv:1909.08491, 2019 - arxiv.org
In this paper, we study how word-like units are represented and activated in a recurrent
neural model of visually grounded speech. The model used in our experiments is trained to …