Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

Speechclip: Integrating speech with pre-trained vision and language model

YJ Shih, HF Wang, HJ Chang, L Berry… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
Data-driven speech processing models usually perform well with a large amount of text
supervision, but collecting transcribed speech data is costly. Therefore, we propose Speech …

What do self-supervised speech models know about words?

A Pasad, CM Chien, S Settle, K Livescu - Transactions of the …, 2024 - direct.mit.edu
Many self-supervised speech models (S3Ms) have been introduced over the last few years,
improving performance and data efficiency on various speech tasks. However, these …

Word segmentation on discovered phone units with dynamic programming and self-supervised scoring

H Kamper - IEEE/ACM Transactions on Audio, Speech, and …, 2022 - ieeexplore.ieee.org
Recent work on unsupervised speech segmentation has used self-supervised models with
phone and word segmentation modules that are trained jointly. This paper instead revisits …

What do self-supervised speech models know about words?

A Pasad, CM Chien, S Settle, K Livescu - arxiv preprint arxiv:2307.00162, 2023 - arxiv.org
Many self-supervised speech models (S3Ms) have been introduced over the last few years,
producing performance and data efficiency improvements for a variety of speech tasks …

Do multimodal large language models and humans ground language Similarly?

CR Jones, B Bergen, S Trott - Computational Linguistics, 2024 - direct.mit.edu
Abstract Large Language Models (LLMs) have been criticized for failing to connect linguistic
meaning to the world—for failing to solve the “symbol grounding problem.” Multimodal Large …

Conceptbeam: Concept driven target speech extraction

Y Ohishi, M Delcroix, T Ochiai, S Araki… - Proceedings of the 30th …, 2022 - dl.acm.org
We propose a novel framework for target speech extraction based on semantic information,
called ConceptBeam. Target speech extraction means extracting the speech of a target …

Separating the" Chirp" from the" Chat": Self-supervised Visual Grounding of Sound and Language

M Hamilton, A Zisserman… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present DenseAV a novel dual encoder grounding architecture that learns high-
resolution semantically meaningful and audio-visual aligned features solely through …

Audio-visual neural syntax acquisition

CIJ Lai, F Shi, P Peng, Y Kim, K Gimpel… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
We study phrase structure induction from visually-grounded speech. The core idea is to first
segment the speech waveform into sequences of word segments, and subsequently induce …

Syllablelm: Learning coarse semantic units for speech language models

A Baade, P Peng, D Harwath - arxiv preprint arxiv:2410.04029, 2024 - arxiv.org
Language models require tokenized inputs. However, tokenization strategies for continuous
data like audio and vision are often based on simple heuristics such as fixed sized …