Self-supervised audio-visual co-segmentation
Segmenting objects in images and separating sound sources in audio are challenging
tasks, in part because traditional approaches require large amounts of labeled data. In this …
tasks, in part because traditional approaches require large amounts of labeled data. In this …
Text-free image-to-speech synthesis using learned segmental units
In this paper we present the first model for directly synthesizing fluent, natural-sounding
spoken audio captions for images that does not require natural language text as an …
spoken audio captions for images that does not require natural language text as an …
Learning hierarchical discrete linguistic units from visually-grounded speech
In this paper, we present a method for learning discrete linguistic units by incorporating
vector quantization layers into neural models of visually grounded speech. We show that our …
vector quantization layers into neural models of visually grounded speech. We show that our …
Speechclip: Integrating speech with pre-trained vision and language model
Data-driven speech processing models usually perform well with a large amount of text
supervision, but collecting transcribed speech data is costly. Therefore, we propose Speech …
supervision, but collecting transcribed speech data is costly. Therefore, we propose Speech …
Cross-modal discrete representation learning
Recent advances in representation learning have demonstrated an ability to represent
information from different modalities such as video, text, and audio in a single high-level …
information from different modalities such as video, text, and audio in a single high-level …
Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques
G Chrupała - Journal of Artificial Intelligence Research, 2022 - jair.org
This survey provides an overview of the evolution of visually grounded models of spoken
language over the last 20 years. Such models are inspired by the observation that when …
language over the last 20 years. Such models are inspired by the observation that when …
Visually grounded few-shot word learning in low-resource settings
We propose a visually grounded speech model that learns new words and their visual
depictions from just a few word-image example pairs. Given a set of test images and a …
depictions from just a few word-image example pairs. Given a set of test images and a …
Improving multimodal speech recognition by data augmentation and speech representations
Multimodal speech recognition aims to improve the performance of automatic speech
recognition (ASR) systems by leveraging additional visual information that is usually …
recognition (ASR) systems by leveraging additional visual information that is usually …
Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning?--A computational investigation
Decades of research has studied how language learning infants learn to discriminate
speech sounds, segment words, and associate words with their meanings. While gradual …
speech sounds, segment words, and associate words with their meanings. While gradual …
Multimodal one-shot learning of speech and images
Imagine a robot is shown new concepts visually together with spoken tags, eg" milk","
eggs"," butter". After seeing one paired audiovisual example per class, it is shown a new set …
eggs"," butter". After seeing one paired audiovisual example per class, it is shown a new set …