- Academic Search

W Lin, T **, W Pan, L Li, X Cheng… - Proceedings of the …, 2023 - aclanthology.org

Audio-visual text generation aims to understand multi-modality contents and translate them
into texts. Although various transfer learning techniques of text generation have been …

Save Cite Cited by 12 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

J Choi, SJ Park, M Kim, YM Ro - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech
Translation (AV2AV) framework where the input and output of the system are multimodal (ie …

Save Cite Cited by 3 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] neurips.cc

Geodesic multi-modal mixup for robust fine-tuning

C Oh, J So, H Byun, YT Lim, M Shin… - Advances in Neural …, 2024 - proceedings.neurips.cc

Pre-trained multi-modal models, such as CLIP, provide transferable embeddings and show
promising results in diverse applications. However, the analysis of learned multi-modal …

Save Cite Cited by 21 Related articles All 7 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] openreview.net

Synctalklip: Highly synchronized lip-readable speaker generation with multi-task learning

X Yang, X Cheng, D Fu, M Fang, J Zuo, S Ji… - Proceedings of the …, 2024 - dl.acm.org

Talking Face Generation (TFG) reconstructs facial motions concerning lips given speech
input, which aims to generate highquality, synchronized, and lip-readable videos. Previous …

Save Cite Cited by 4 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] openreview.net

Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts

D Fu, X Cheng, X Yang, W Hanting, Z Zhao… - Proceedings of the 32nd …, 2024 - dl.acm.org

In the burgeoning field of Audio-Visual Speech Recognition (AVSR), extant research has
predominantly concentrated on the training paradigms tailored for high-quality resources …

Save Cite Cited by 3 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] aclanthology.org

Contrastive token-wise meta-learning for unseen performer visual temporal-aligned translation

L Li, T **, X Cheng, Y Wang, W Lin… - Findings of the …, 2023 - aclanthology.org

Visual temporal-aligned translation aims to transform the visual sequence into natural
words, including important applicable tasks such as lipreading and fingerspelling …

Save Cite Cited by 6 Related articles View as HTML

[Free GPT-4]

[PDF] arxiv.org

Opensr: Open-modality speech recognition via maintaining multi-modality alignment

X Cheng, T **, L Li, W Lin, X Duan, Z Zhao - arxiv preprint arxiv …, 2023 - arxiv.org

Speech Recognition builds a bridge between the multimedia streaming (audio-only, visual-
only or audio-visual) and the corresponding text transcription. However, when training the …

Save Cite Cited by 16 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Ace: A generative cross-modal retrieval framework with coarse-to-fine semantic modeling

M Fang, S Ji, J Zuo, H Huang, Y **a, J Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org

Generative retrieval, which has demonstrated effectiveness in text-to-text retrieval, utilizes a
sequence-to-sequence model to directly generate candidate identifiers based on natural …

Save Cite Cited by 4 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] aclanthology.org

Scene-robust natural language video localization via learning domain-invariant representations

Z Wang, Y Zhao, H Huang, Y **a… - Findings of the …, 2023 - aclanthology.org

Natural language video localization (NLVL) task involves the semantic matching of a text
query with a moment from an untrimmed video. Previous methods primarily focus on …

Save Cite Cited by 6 Related articles View as HTML

[Free GPT-4]

[PDF] arxiv.org

Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

J Tan, X Cheng, L **ong, L Zhu, X Li… - … on Multimedia and …, 2024 - ieeexplore.ieee.org

Audio-driven talking head generation is a significant and challenging task applicable to
various fields such as virtual avatars, film production, and online conferences. However, the …

Save Cite Cited by 2 Related articles All 3 versions Free GPT-4

Create alert

Cite

Advanced search

Saved to My library

Mixspeech: Cross-modality self-learning with audio-visual stream mixup for visual speech...

TAVT: Towards Transferable Audio-Visual Text Generation

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

Geodesic multi-modal mixup for robust fine-tuning

Synctalklip: Highly synchronized lip-readable speaker generation with multi-task learning

Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts

Contrastive token-wise meta-learning for unseen performer visual temporal-aligned translation

Opensr: Open-modality speech recognition via maintaining multi-modality alignment

Ace: A generative cross-modal retrieval framework with coarse-to-fine semantic modeling

Scene-robust natural language video localization via learning domain-invariant representations

Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation