Large-scale multi-modal pre-trained models: A comprehensive survey
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …
Vlp: A survey on vision-language pre-training
In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …
such as computer vision (CV) and natural language processing (NLP) to a new era …
From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities
The multimodal task of Visual Question Answering (VQA) encompassing elements of
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …
Unsupervised and pseudo-supervised vision-language alignment in visual dialog
Visual dialog requires models to give reasonable answers according to a series of coherent
questions and related visual concepts in images. However, most current work either focuses …
questions and related visual concepts in images. However, most current work either focuses …
M-RAT: a Multi-grained Retrieval Augmentation Transformer for Image Captioning
J Song, R Pan, J Zhou, H Yang - Proceedings of the Asian …, 2024 - openaccess.thecvf.com
Current encoder-decoder methods for image captioning mainly consist of an object
detection module (two-stage), or rely on big models with large-scale datasets to improve the …
detection module (two-stage), or rely on big models with large-scale datasets to improve the …
Robust Contrastive Learning With Dynamic Mixed Margin
One of the promising ways for the representation learning is contrastive learning. It enforces
that positive pairs become close while negative pairs become far. Contrastive learning …
that positive pairs become close while negative pairs become far. Contrastive learning …
Multi-Granularity and Multi-modal Feature Interaction Approach for Text Video Retrieval
W Li, S Wang, D Zhao, S Xu, Z Pan, Z Zhang - arxiv preprint arxiv …, 2024 - arxiv.org
The key of the text-to-video retrieval (TVR) task lies in learning the unique similarity between
each pair of text (consisting of words) and video (consisting of audio and image frames) …
each pair of text (consisting of words) and video (consisting of audio and image frames) …