TAVT: Towards Transferable Audio-Visual Text Generation

W Lin, T **, W Pan, L Li, X Cheng… - Proceedings of the …, 2023 - aclanthology.org
Audio-visual text generation aims to understand multi-modality contents and translate them
into texts. Although various transfer learning techniques of text generation have been …

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

J Choi, SJ Park, M Kim, YM Ro - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech
Translation (AV2AV) framework where the input and output of the system are multimodal (ie …

Geodesic multi-modal mixup for robust fine-tuning

C Oh, J So, H Byun, YT Lim, M Shin… - Advances in Neural …, 2024 - proceedings.neurips.cc
Pre-trained multi-modal models, such as CLIP, provide transferable embeddings and show
promising results in diverse applications. However, the analysis of learned multi-modal …

Synctalklip: Highly synchronized lip-readable speaker generation with multi-task learning

X Yang, X Cheng, D Fu, M Fang, J Zuo, S Ji… - Proceedings of the …, 2024 - dl.acm.org
Talking Face Generation (TFG) reconstructs facial motions concerning lips given speech
input, which aims to generate highquality, synchronized, and lip-readable videos. Previous …

Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts

D Fu, X Cheng, X Yang, W Hanting, Z Zhao… - Proceedings of the 32nd …, 2024 - dl.acm.org
In the burgeoning field of Audio-Visual Speech Recognition (AVSR), extant research has
predominantly concentrated on the training paradigms tailored for high-quality resources …

Contrastive token-wise meta-learning for unseen performer visual temporal-aligned translation

L Li, T **, X Cheng, Y Wang, W Lin… - Findings of the …, 2023 - aclanthology.org
Visual temporal-aligned translation aims to transform the visual sequence into natural
words, including important applicable tasks such as lipreading and fingerspelling …

Opensr: Open-modality speech recognition via maintaining multi-modality alignment

X Cheng, T **, L Li, W Lin, X Duan, Z Zhao - arxiv preprint arxiv …, 2023 - arxiv.org
Speech Recognition builds a bridge between the multimedia streaming (audio-only, visual-
only or audio-visual) and the corresponding text transcription. However, when training the …

Ace: A generative cross-modal retrieval framework with coarse-to-fine semantic modeling

M Fang, S Ji, J Zuo, H Huang, Y **a, J Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
Generative retrieval, which has demonstrated effectiveness in text-to-text retrieval, utilizes a
sequence-to-sequence model to directly generate candidate identifiers based on natural …

Scene-robust natural language video localization via learning domain-invariant representations

Z Wang, Y Zhao, H Huang, Y **a… - Findings of the …, 2023 - aclanthology.org
Natural language video localization (NLVL) task involves the semantic matching of a text
query with a moment from an untrimmed video. Previous methods primarily focus on …

Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

J Tan, X Cheng, L **ong, L Zhu, X Li… - … on Multimedia and …, 2024 - ieeexplore.ieee.org
Audio-driven talking head generation is a significant and challenging task applicable to
various fields such as virtual avatars, film production, and online conferences. However, the …