Self-supervised speech representation learning: A review
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …
necessitated the building of specialist models for individual tasks and application scenarios …
Comparative layer-wise analysis of self-supervised speech models
Many self-supervised speech models, varying in their pre-training objective, input modality,
and pre-training data, have been proposed in the last few years. Despite impressive …
and pre-training data, have been proposed in the last few years. Despite impressive …
Word discovery in visually grounded, self-supervised speech models
We present a method for visually-grounded spoken term discovery. After training either a
HuBERT or wav2vec2. 0 model to associate spoken captions with natural images, we show …
HuBERT or wav2vec2. 0 model to associate spoken captions with natural images, we show …
Speechclip: Integrating speech with pre-trained vision and language model
Data-driven speech processing models usually perform well with a large amount of text
supervision, but collecting transcribed speech data is costly. Therefore, we propose Speech …
supervision, but collecting transcribed speech data is costly. Therefore, we propose Speech …
Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques
G Chrupała - Journal of Artificial Intelligence Research, 2022 - jair.org
This survey provides an overview of the evolution of visually grounded models of spoken
language over the last 20 years. Such models are inspired by the observation that when …
language over the last 20 years. Such models are inspired by the observation that when …
Self-supervised representation learning for speech using visual grounding and masked language modeling
In this paper, we describe our submissions to the ZeroSpeech 2021 Challenge and
SUPERB benchmark. Our submissions are based on the recently proposed FaST-VGS …
SUPERB benchmark. Our submissions are based on the recently proposed FaST-VGS …
Conceptbeam: Concept driven target speech extraction
Y Ohishi, M Delcroix, T Ochiai, S Araki… - Proceedings of the 30th …, 2022 - dl.acm.org
We propose a novel framework for target speech extraction based on semantic information,
called ConceptBeam. Target speech extraction means extracting the speech of a target …
called ConceptBeam. Target speech extraction means extracting the speech of a target …
What a whole slide image can tell? subtype-guided masked transformer for pathological image captioning
Pathological captioning of Whole Slide Images (WSIs), though is essential in computer-
aided pathological diagnosis, has rarely been studied due to the limitations in datasets and …
aided pathological diagnosis, has rarely been studied due to the limitations in datasets and …
Learning english with peppa pig
Recent computational models of the acquisition of spoken language via grounding in
perception exploit associations between spoken and visual modalities and learn to …
perception exploit associations between spoken and visual modalities and learn to …
M-SpeechCLIP: Leveraging large-scale, pre-trained models for multilingual speech to image retrieval
This work investigates the use of large-scale, English-only pre-trained models (CLIP and
HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval …
HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval …