- Academic Search

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer

With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

保存引用被引用数: 190 関連記事全 8 バージョン

[Free GPT-4]

[PDF] springer.com

Vlp: A survey on vision-language pre-training

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer

In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

保存引用被引用数: 218 関連記事全 10 バージョン

[Free GPT-4]

[PDF] arxiv.org

From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities

MF Ishmam, MSH Shovon, MF Mridha, N Dey - Information Fusion, 2024 - Elsevier

The multimodal task of Visual Question Answering (VQA) encompassing elements of
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …

保存引用被引用数: 23 関連記事全 2 バージョン

[Free GPT-4]

[PDF] acm.org

Unsupervised and pseudo-supervised vision-language alignment in visual dialog

F Chen, D Zhang, X Chen, J Shi, S Xu… - Proceedings of the 30th …, 2022 - dl.acm.org

Visual dialog requires models to give reasonable answers according to a series of coherent
questions and related visual concepts in images. However, most current work either focuses …

保存引用被引用数: 16 関連記事

[Free GPT-4]

[PDF] thecvf.com

M-RAT: a Multi-grained Retrieval Augmentation Transformer for Image Captioning

J Song, R Pan, J Zhou, H Yang - Proceedings of the Asian …, 2024 - openaccess.thecvf.com

Current encoder-decoder methods for image captioning mainly consist of an object
detection module (two-stage), or rely on big models with large-scale datasets to improve the …

保存引用関連記事全 3 バージョン HTMLバージョン

[Free GPT-4]

[PDF] ieee.org

Robust Contrastive Learning With Dynamic Mixed Margin

J So, Y Lim, Y Kim, C Oh, K Song - IEEE Access, 2023 - ieeexplore.ieee.org

One of the promising ways for the representation learning is contrastive learning. It enforces
that positive pairs become close while negative pairs become far. Contrastive learning …

保存引用被引用数: 1 関連記事全 3 バージョン

[Free GPT-4]

[PDF] arxiv.org

Multi-Granularity and Multi-modal Feature Interaction Approach for Text Video Retrieval

W Li, S Wang, D Zhao, S Xu, Z Pan, Z Zhang - arxiv preprint arxiv …, 2024 - arxiv.org

The key of the text-to-video retrieval (TVR) task lies in learning the unique similarity between
each pair of text (consisting of words) and video (consisting of audio and image frames) …

保存引用関連記事全 2 バージョン HTMLバージョン

アラートを作成

引用

検索オプション

マイライブラリに保存しました

HiVLP: Hierarchical vision-language pre-training for fast image-text retrieval

Large-scale multi-modal pre-trained models: A comprehensive survey

Vlp: A survey on vision-language pre-training

From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities

Unsupervised and pseudo-supervised vision-language alignment in visual dialog

M-RAT: a Multi-grained Retrieval Augmentation Transformer for Image Captioning

Robust Contrastive Learning With Dynamic Mixed Margin

Multi-Granularity and Multi-modal Feature Interaction Approach for Text Video Retrieval