Self-supervised audio-visual co-segmentation

A Rouditchenko, H Zhao, C Gan… - ICASSP 2019-2019 …, 2019 - ieeexplore.ieee.org
Segmenting objects in images and separating sound sources in audio are challenging
tasks, in part because traditional approaches require large amounts of labeled data. In this …

Text-free image-to-speech synthesis using learned segmental units

WN Hsu, D Harwath, C Song, J Glass - arxiv preprint arxiv:2012.15454, 2020 - arxiv.org
In this paper we present the first model for directly synthesizing fluent, natural-sounding
spoken audio captions for images that does not require natural language text as an …

Learning hierarchical discrete linguistic units from visually-grounded speech

D Harwath, WN Hsu, J Glass - arxiv preprint arxiv:1911.09602, 2019 - arxiv.org
In this paper, we present a method for learning discrete linguistic units by incorporating
vector quantization layers into neural models of visually grounded speech. We show that our …

Speechclip: Integrating speech with pre-trained vision and language model

YJ Shih, HF Wang, HJ Chang, L Berry… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
Data-driven speech processing models usually perform well with a large amount of text
supervision, but collecting transcribed speech data is costly. Therefore, we propose Speech …

Cross-modal discrete representation learning

AH Liu, SY **, CIJ Lai, A Rouditchenko, A Oliva… - arxiv preprint arxiv …, 2021 - arxiv.org
Recent advances in representation learning have demonstrated an ability to represent
information from different modalities such as video, text, and audio in a single high-level …

Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques

G Chrupała - Journal of Artificial Intelligence Research, 2022 - jair.org
This survey provides an overview of the evolution of visually grounded models of spoken
language over the last 20 years. Such models are inspired by the observation that when …

Visually grounded few-shot word learning in low-resource settings

L Nortje, D Oneaţă, H Kamper - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org
We propose a visually grounded speech model that learns new words and their visual
depictions from just a few word-image example pairs. Given a set of test images and a …

Improving multimodal speech recognition by data augmentation and speech representations

D Oneață, H Cucu - … of the IEEE/CVF Conference on …, 2022 - openaccess.thecvf.com
Multimodal speech recognition aims to improve the performance of automatic speech
recognition (ASR) systems by leveraging additional visual information that is usually …

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning?--A computational investigation

K Khorrami, O Räsänen - arxiv preprint arxiv:2109.14200, 2021 - arxiv.org
Decades of research has studied how language learning infants learn to discriminate
speech sounds, segment words, and associate words with their meanings. While gradual …

Multimodal one-shot learning of speech and images

R Eloff, HA Engelbrecht… - ICASSP 2019-2019 IEEE …, 2019 - ieeexplore.ieee.org
Imagine a robot is shown new concepts visually together with spoken tags, eg" milk","
eggs"," butter". After seeing one paired audiovisual example per class, it is shown a new set …