Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

Vlp: A survey on vision-language pre-training

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer
In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities

MF Ishmam, MSH Shovon, MF Mridha, N Dey - Information Fusion, 2024 - Elsevier
The multimodal task of Visual Question Answering (VQA) encompassing elements of
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …

Unsupervised and pseudo-supervised vision-language alignment in visual dialog

F Chen, D Zhang, X Chen, J Shi, S Xu… - Proceedings of the 30th …, 2022 - dl.acm.org
Visual dialog requires models to give reasonable answers according to a series of coherent
questions and related visual concepts in images. However, most current work either focuses …

M-RAT: a Multi-grained Retrieval Augmentation Transformer for Image Captioning

J Song, R Pan, J Zhou, H Yang - Proceedings of the Asian …, 2024 - openaccess.thecvf.com
Current encoder-decoder methods for image captioning mainly consist of an object
detection module (two-stage), or rely on big models with large-scale datasets to improve the …

Robust Contrastive Learning With Dynamic Mixed Margin

J So, Y Lim, Y Kim, C Oh, K Song - IEEE Access, 2023 - ieeexplore.ieee.org
One of the promising ways for the representation learning is contrastive learning. It enforces
that positive pairs become close while negative pairs become far. Contrastive learning …

Multi-Granularity and Multi-modal Feature Interaction Approach for Text Video Retrieval

W Li, S Wang, D Zhao, S Xu, Z Pan, Z Zhang - arxiv preprint arxiv …, 2024 - arxiv.org
The key of the text-to-video retrieval (TVR) task lies in learning the unique similarity between
each pair of text (consisting of words) and video (consisting of audio and image frames) …