Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring

J Hong, M Kim, J Choi, YM Ro - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
This paper deals with Audio-Visual Speech Recognition (AVSR) under multimodal input
corruption situation where audio inputs and visual inputs are both corrupted, which is not …

VatLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Q Zhu, L Zhou, Z Zhang, S Liu, B Jiao… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
Although speech is a simple and effective way for humans to communicate with the outside
world, a more realistic speech interaction contains multimodal information, eg, vision, text …

Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge

M Kim, JH Yeo, J Choi, YM Ro - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
This paper proposes a novel lip reading framework, especially for low-resource languages,
which has not been well addressed in the previous literature. Since low-resource languages …

Lip to speech synthesis with visual context attentional gan

M Kim, J Hong, YM Ro - Advances in Neural Information …, 2021 - proceedings.neurips.cc
In this paper, we propose a novel lip-to-speech generative adversarial network, Visual
Context Attentional GAN (VCA-GAN), which can jointly model local and global lip …

Analyzing lower half facial gestures for lip reading applications: Survey on vision techniques

SJ Preethi - Computer Vision and Image Understanding, 2023 - Elsevier
Lip reading has gained popularity due to the proliferation of emerging real-world
applications. This article provides a comprehensive review of benchmark datasets available …

Speaker-adaptive lip reading with user-dependent padding

M Kim, H Kim, YM Ro - European Conference on Computer Vision, 2022 - Springer
Lip reading aims to predict speech based on lip movements alone. As it focuses on visual
information to model the speech, its performance is inherently sensitive to personal lip …

Many-to-many spoken language translation via unified speech and text representation learning with unit-to-unit translation

M Kim, J Choi, D Kim, YM Ro - arxiv preprint arxiv:2308.01831, 2023 - arxiv.org
In this paper, we propose a method to learn unified representations of multilingual speech
and text with a single model, especially focusing on the purpose of speech synthesis. We …

Prompt tuning of deep neural networks for speaker-adaptive visual speech recognition

M Kim, HI Kim, YM Ro - IEEE Transactions on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Visual Speech Recognition (VSR) aims to infer speech into text depending on lip
movements alone. As it focuses on visual information to model the speech, its performance …

Intelligible lip-to-speech synthesis with speech units

J Choi, M Kim, YM Ro - arxiv preprint arxiv:2305.19603, 2023 - arxiv.org
In this paper, we propose a novel Lip-to-Speech synthesis (L2S) framework, for synthesizing
intelligible speech from a silent lip movement video. Specifically, to complement the …

Akvsr: Audio knowledge empowered visual speech recognition by compressing audio knowledge of a pretrained model

JH Yeo, M Kim, J Choi, DH Kim… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip
movements. VSR is regarded as a challenging task because of the insufficient information …