The ethical implications of generative audio models: A systematic literature review

J Barnett - Proceedings of the 2023 AAAI/ACM Conference on AI …, 2023 - dl.acm.org
Generative audio models typically focus their applications in music and speech generation,
with recent models having human-like quality in their audio output. This paper conducts a …

Human-computer interaction system: A survey of talking-head generation

R Zhen, W Song, Q He, J Cao, L Shi, J Luo - Electronics, 2023 - mdpi.com
Virtual human is widely employed in various industries, including personal assistance,
intelligent customer service, and online education, thanks to the rapid development of …

Mixspeech: Cross-modality self-learning with audio-visual stream mixup for visual speech translation and recognition

X Cheng, T **, R Huang, L Li, W Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Multi-media communications facilitate global interaction among people. However, despite
researchers exploring cross-lingual translation techniques such as machine translation and …

Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation

D Yaman, FI Eyiokur, L Bärmann… - … Of The IEEE/CVF …, 2024 - openaccess.thecvf.com
In the task of talking face generation the objective is to generate a face video with lips
synchronized to the corresponding audio while preserving visual details and identity …

A holistic cascade system, benchmark, and human evaluation protocol for expressive speech-to-speech translation

WC Huang, B Peloquin, J Kao, C Wang… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Expressive speech-to-speech translation (S2ST) aims to transfer prosodic attributes of
source speech to target speech while maintaining translation accuracy. Existing research in …

Transface: Unit-based audio-visual speech synthesizer for talking head translation

X Cheng, R Huang, L Li, T **, Z Wang, A Yin… - arxiv preprint arxiv …, 2023 - arxiv.org
Direct speech-to-speech translation achieves high-quality results through the introduction of
discrete units obtained from self-supervised learning. This approach circumvents delays and …

A Systematic Literature Review: Facial Expression and Lip Movement Synchronization of an Audio Track

MH Alshahrani, MS Maashi - IEEE Access, 2024 - ieeexplore.ieee.org
This systematic literature review (SLR) explores the topic of Facial Expression and Lip
Movement Synchronization of an Audio Track in the context of Automatic Dubbing. This SLR …

Talking face generation with audio-deduced emotional landmarks

S Zhai, M Liu, Y Li, Z Gao, L Zhu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
The goal of talking face generation is to synthesize a sequence of face images of the
specified identity, ensuring the mouth movements are synchronized with the given audio …

A Unit-based System and Dataset for Expressive Direct Speech-to-Speech Translation

A Min, C Hu, Y Ren, H Zhao - arxiv preprint arxiv:2502.00374, 2025 - arxiv.org
Current research in speech-to-speech translation (S2ST) primarily concentrates on
translation accuracy and speech naturalness, often overlooking key elements like …