Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring
This paper deals with Audio-Visual Speech Recognition (AVSR) under multimodal input
corruption situation where audio inputs and visual inputs are both corrupted, which is not …
corruption situation where audio inputs and visual inputs are both corrupted, which is not …
VatLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning
Although speech is a simple and effective way for humans to communicate with the outside
world, a more realistic speech interaction contains multimodal information, eg, vision, text …
world, a more realistic speech interaction contains multimodal information, eg, vision, text …
Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge
This paper proposes a novel lip reading framework, especially for low-resource languages,
which has not been well addressed in the previous literature. Since low-resource languages …
which has not been well addressed in the previous literature. Since low-resource languages …
Lip to speech synthesis with visual context attentional gan
In this paper, we propose a novel lip-to-speech generative adversarial network, Visual
Context Attentional GAN (VCA-GAN), which can jointly model local and global lip …
Context Attentional GAN (VCA-GAN), which can jointly model local and global lip …
Analyzing lower half facial gestures for lip reading applications: Survey on vision techniques
SJ Preethi - Computer Vision and Image Understanding, 2023 - Elsevier
Lip reading has gained popularity due to the proliferation of emerging real-world
applications. This article provides a comprehensive review of benchmark datasets available …
applications. This article provides a comprehensive review of benchmark datasets available …
Speaker-adaptive lip reading with user-dependent padding
Lip reading aims to predict speech based on lip movements alone. As it focuses on visual
information to model the speech, its performance is inherently sensitive to personal lip …
information to model the speech, its performance is inherently sensitive to personal lip …
Many-to-many spoken language translation via unified speech and text representation learning with unit-to-unit translation
In this paper, we propose a method to learn unified representations of multilingual speech
and text with a single model, especially focusing on the purpose of speech synthesis. We …
and text with a single model, especially focusing on the purpose of speech synthesis. We …
Prompt tuning of deep neural networks for speaker-adaptive visual speech recognition
Visual Speech Recognition (VSR) aims to infer speech into text depending on lip
movements alone. As it focuses on visual information to model the speech, its performance …
movements alone. As it focuses on visual information to model the speech, its performance …
Intelligible lip-to-speech synthesis with speech units
In this paper, we propose a novel Lip-to-Speech synthesis (L2S) framework, for synthesizing
intelligible speech from a silent lip movement video. Specifically, to complement the …
intelligible speech from a silent lip movement video. Specifically, to complement the …
Akvsr: Audio knowledge empowered visual speech recognition by compressing audio knowledge of a pretrained model
Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip
movements. VSR is regarded as a challenging task because of the insufficient information …
movements. VSR is regarded as a challenging task because of the insufficient information …