Self-supervised speech representation learning: A review
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …
necessitated the building of specialist models for individual tasks and application scenarios …
Text-free image-to-speech synthesis using learned segmental units
In this paper we present the first model for directly synthesizing fluent, natural-sounding
spoken audio captions for images that does not require natural language text as an …
spoken audio captions for images that does not require natural language text as an …
Learning hierarchical discrete linguistic units from visually-grounded speech
In this paper, we present a method for learning discrete linguistic units by incorporating
vector quantization layers into neural models of visually grounded speech. We show that our …
vector quantization layers into neural models of visually grounded speech. We show that our …
Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques
G Chrupała - Journal of Artificial Intelligence Research, 2022 - jair.org
This survey provides an overview of the evolution of visually grounded models of spoken
language over the last 20 years. Such models are inspired by the observation that when …
language over the last 20 years. Such models are inspired by the observation that when …
Learning english with peppa pig
Recent computational models of the acquisition of spoken language via grounding in
perception exploit associations between spoken and visual modalities and learn to …
perception exploit associations between spoken and visual modalities and learn to …
Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning?--A computational investigation
Decades of research has studied how language learning infants learn to discriminate
speech sounds, segment words, and associate words with their meanings. While gradual …
speech sounds, segment words, and associate words with their meanings. While gradual …
M-SpeechCLIP: Leveraging large-scale, pre-trained models for multilingual speech to image retrieval
This work investigates the use of large-scale, English-only pre-trained models (CLIP and
HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval …
HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval …
Trilingual semantic embeddings of visually grounded speech with self-attention mechanisms
Y Ohishi, A Kimura, T Kawanishi… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
We propose a trilingual semantic embedding model that associates visual objects in images
with segments of speech signals corresponding to spoken words in an unsupervised …
with segments of speech signals corresponding to spoken words in an unsupervised …
A spoken language dataset of descriptions for speech-based grounded language learning
Grounded language acquisition is a major area of research combining aspects of natural
language processing, computer vision, and signal processing, compounded by domain …
language processing, computer vision, and signal processing, compounded by domain …
Word recognition, competition, and activation in a model of visually grounded speech
In this paper, we study how word-like units are represented and activated in a recurrent
neural model of visually grounded speech. The model used in our experiments is trained to …
neural model of visually grounded speech. The model used in our experiments is trained to …