Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

Comparative layer-wise analysis of self-supervised speech models

A Pasad, B Shi, K Livescu - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
Many self-supervised speech models, varying in their pre-training objective, input modality,
and pre-training data, have been proposed in the last few years. Despite impressive …

Word discovery in visually grounded, self-supervised speech models

P Peng, D Harwath - arxiv preprint arxiv:2203.15081, 2022 - arxiv.org
We present a method for visually-grounded spoken term discovery. After training either a
HuBERT or wav2vec2. 0 model to associate spoken captions with natural images, we show …

Speechclip: Integrating speech with pre-trained vision and language model

YJ Shih, HF Wang, HJ Chang, L Berry… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
Data-driven speech processing models usually perform well with a large amount of text
supervision, but collecting transcribed speech data is costly. Therefore, we propose Speech …

Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques

G Chrupała - Journal of Artificial Intelligence Research, 2022 - jair.org
This survey provides an overview of the evolution of visually grounded models of spoken
language over the last 20 years. Such models are inspired by the observation that when …

Self-supervised representation learning for speech using visual grounding and masked language modeling

P Peng, D Harwath - arxiv preprint arxiv:2202.03543, 2022 - arxiv.org
In this paper, we describe our submissions to the ZeroSpeech 2021 Challenge and
SUPERB benchmark. Our submissions are based on the recently proposed FaST-VGS …

Conceptbeam: Concept driven target speech extraction

Y Ohishi, M Delcroix, T Ochiai, S Araki… - Proceedings of the 30th …, 2022 - dl.acm.org
We propose a novel framework for target speech extraction based on semantic information,
called ConceptBeam. Target speech extraction means extracting the speech of a target …

What a whole slide image can tell? subtype-guided masked transformer for pathological image captioning

W Qin, R Xu, P Huang, X Wu, H Zhang… - arxiv preprint arxiv …, 2023 - arxiv.org
Pathological captioning of Whole Slide Images (WSIs), though is essential in computer-
aided pathological diagnosis, has rarely been studied due to the limitations in datasets and …

Learning english with peppa pig

M Nikolaus, A Alishahi, G Chrupała - Transactions of the Association …, 2022 - direct.mit.edu
Recent computational models of the acquisition of spoken language via grounding in
perception exploit associations between spoken and visual modalities and learn to …

M-SpeechCLIP: Leveraging large-scale, pre-trained models for multilingual speech to image retrieval

L Berry, YJ Shih, HF Wang, HJ Chang… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
This work investigates the use of large-scale, English-only pre-trained models (CLIP and
HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval …