TAVT: Towards Transferable Audio-Visual Text Generation
Audio-visual text generation aims to understand multi-modality contents and translate them
into texts. Although various transfer learning techniques of text generation have been …
into texts. Although various transfer learning techniques of text generation have been …
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech
Translation (AV2AV) framework where the input and output of the system are multimodal (ie …
Translation (AV2AV) framework where the input and output of the system are multimodal (ie …
Geodesic multi-modal mixup for robust fine-tuning
Pre-trained multi-modal models, such as CLIP, provide transferable embeddings and show
promising results in diverse applications. However, the analysis of learned multi-modal …
promising results in diverse applications. However, the analysis of learned multi-modal …
Synctalklip: Highly synchronized lip-readable speaker generation with multi-task learning
Talking Face Generation (TFG) reconstructs facial motions concerning lips given speech
input, which aims to generate highquality, synchronized, and lip-readable videos. Previous …
input, which aims to generate highquality, synchronized, and lip-readable videos. Previous …
Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts
In the burgeoning field of Audio-Visual Speech Recognition (AVSR), extant research has
predominantly concentrated on the training paradigms tailored for high-quality resources …
predominantly concentrated on the training paradigms tailored for high-quality resources …
Contrastive token-wise meta-learning for unseen performer visual temporal-aligned translation
Visual temporal-aligned translation aims to transform the visual sequence into natural
words, including important applicable tasks such as lipreading and fingerspelling …
words, including important applicable tasks such as lipreading and fingerspelling …
Opensr: Open-modality speech recognition via maintaining multi-modality alignment
Speech Recognition builds a bridge between the multimedia streaming (audio-only, visual-
only or audio-visual) and the corresponding text transcription. However, when training the …
only or audio-visual) and the corresponding text transcription. However, when training the …
Ace: A generative cross-modal retrieval framework with coarse-to-fine semantic modeling
Generative retrieval, which has demonstrated effectiveness in text-to-text retrieval, utilizes a
sequence-to-sequence model to directly generate candidate identifiers based on natural …
sequence-to-sequence model to directly generate candidate identifiers based on natural …
Scene-robust natural language video localization via learning domain-invariant representations
Natural language video localization (NLVL) task involves the semantic matching of a text
query with a moment from an untrimmed video. Previous methods primarily focus on …
query with a moment from an untrimmed video. Previous methods primarily focus on …
Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation
Audio-driven talking head generation is a significant and challenging task applicable to
various fields such as virtual avatars, film production, and online conferences. However, the …
various fields such as virtual avatars, film production, and online conferences. However, the …