Self-supervised speech representation learning: A review
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …
necessitated the building of specialist models for individual tasks and application scenarios …
Speechclip: Integrating speech with pre-trained vision and language model
Data-driven speech processing models usually perform well with a large amount of text
supervision, but collecting transcribed speech data is costly. Therefore, we propose Speech …
supervision, but collecting transcribed speech data is costly. Therefore, we propose Speech …
What do self-supervised speech models know about words?
Many self-supervised speech models (S3Ms) have been introduced over the last few years,
improving performance and data efficiency on various speech tasks. However, these …
improving performance and data efficiency on various speech tasks. However, these …
Word segmentation on discovered phone units with dynamic programming and self-supervised scoring
H Kamper - IEEE/ACM Transactions on Audio, Speech, and …, 2022 - ieeexplore.ieee.org
Recent work on unsupervised speech segmentation has used self-supervised models with
phone and word segmentation modules that are trained jointly. This paper instead revisits …
phone and word segmentation modules that are trained jointly. This paper instead revisits …
What do self-supervised speech models know about words?
Many self-supervised speech models (S3Ms) have been introduced over the last few years,
producing performance and data efficiency improvements for a variety of speech tasks …
producing performance and data efficiency improvements for a variety of speech tasks …
Do multimodal large language models and humans ground language Similarly?
Abstract Large Language Models (LLMs) have been criticized for failing to connect linguistic
meaning to the world—for failing to solve the “symbol grounding problem.” Multimodal Large …
meaning to the world—for failing to solve the “symbol grounding problem.” Multimodal Large …
Conceptbeam: Concept driven target speech extraction
We propose a novel framework for target speech extraction based on semantic information,
called ConceptBeam. Target speech extraction means extracting the speech of a target …
called ConceptBeam. Target speech extraction means extracting the speech of a target …
Separating the" Chirp" from the" Chat": Self-supervised Visual Grounding of Sound and Language
We present DenseAV a novel dual encoder grounding architecture that learns high-
resolution semantically meaningful and audio-visual aligned features solely through …
resolution semantically meaningful and audio-visual aligned features solely through …
Audio-visual neural syntax acquisition
We study phrase structure induction from visually-grounded speech. The core idea is to first
segment the speech waveform into sequences of word segments, and subsequently induce …
segment the speech waveform into sequences of word segments, and subsequently induce …
Syllablelm: Learning coarse semantic units for speech language models
Language models require tokenized inputs. However, tokenization strategies for continuous
data like audio and vision are often based on simple heuristics such as fixed sized …
data like audio and vision are often based on simple heuristics such as fixed sized …