Deep learning for visual speech analysis: A survey

C Sheng, G Kuang, L Bai, C Hou, Y Guo… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Visual speech, referring to the visual domain of speech, has attracted increasing attention
due to its wide applications, such as public security, medical treatment, military defense, and …

Facefilter: Audio-visual speech separation using still images

SW Chung, S Choe, JS Chung, HG Kang - arxiv preprint arxiv …, 2020 - arxiv.org
The objective of this paper is to separate a target speaker's speech from a mixture of two
speakers using a deep audio-visual speech separation network. Unlike previous works that …

Imaginary voice: Face-styled diffusion model for text-to-speech

J Lee, JS Chung, SW Chung - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
The goal of this work is zero-shot text-to-speech synthesis, with speaking styles and voices
learnt from facial characteristics. Inspired by the natural fact that people can imagine the …

Looking into your speech: Learning cross-modal affinity for audio-visual speech separation

J Lee, SW Chung, S Kim, HG Kang… - Proceedings of the …, 2021 - openaccess.thecvf.com
In this paper, we address the problem of separating individual speech signals from videos
using audio-visual neural processing. Most conventional approaches utilize frame-wise …

Lira: Learning visual speech representations from audio through self-supervision

P Ma, R Mira, S Petridis, BW Schuller… - arxiv preprint arxiv …, 2021 - arxiv.org
The large amount of audiovisual content being shared online today has drawn substantial
attention to the prospect of audiovisual self-supervised learning. Recent works have focused …

Target speech diarization with multimodal prompts

Y Jiang, R Tao, Z Chen, Y Qian, H Li - arxiv preprint arxiv:2406.07198, 2024 - arxiv.org
Traditional speaker diarization seeks to detect``who spoke when''according to speaker
characteristics. Extending to target speech diarization, we detect``when target event …

Vocalist: An audio-visual synchronisation model for lips and voices

VS Kadandale, JF Montesinos, G Haro - arxiv preprint arxiv:2204.02090, 2022 - arxiv.org
In this paper, we address the problem of lip-voice synchronisation in videos containing
human face and voice. Our approach is based on determining if the lips motion and the …

Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision

SW Chung, HG Kang, JS Chung - arxiv preprint arxiv:2004.14326, 2020 - arxiv.org
The goal of this work is to train discriminative cross-modal embeddings without access to
manually annotated data. Recent advances in self-supervised learning have shown that …

Look who's talking: Active speaker detection in the wild

YJ Kim, HS Heo, S Choe, SW Chung, Y Kwon… - arxiv preprint arxiv …, 2021 - arxiv.org
In this work, we present a novel audio-visual dataset for active speaker detection in the wild.
A speaker is considered active when his or her face is visible and the voice is audible …

Improved lite audio-visual speech enhancement

SY Chuang, HM Wang, Y Tsao - IEEE/ACM Transactions on …, 2022 - ieeexplore.ieee.org
Numerous studies have investigated the effectiveness of audio-visual multimodal learning
for speech enhancement (AVSE) tasks, seeking a solution that uses visual data as auxiliary …