- Academic Search

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Save Cite Cited by 197 Related articles All 7 versions Free GPT-4 Library Search View as HTML

[Free GPT-4]

[PDF] thecvf.com

Prompting large language models with answer heuristics for knowledge-based visual question answering

Z Shao, Z Yu, M Wang, J Yu - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Abstract Knowledge-based visual question answering (VQA) requires external knowledge
beyond the image to answer the question. Early studies retrieve required knowledge from …

Save Cite Cited by 210 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] aaai.org

An empirical study of gpt-3 for few-shot knowledge-based vqa

Z Yang, Z Gan, J Wang, X Hu, Y Lu, Z Liu… - Proceedings of the AAAI …, 2022 - ojs.aaai.org

Abstract Knowledge-based visual question answering (VQA) involves answering questions
that require external knowledge not present in the image. Existing methods first retrieve …

Save Cite Cited by 441 Related articles All 6 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] nowpublishers.com

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com

Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Save Cite Cited by 212 Related articles All 6 versions Free GPT-4 Library Search View as HTML

[Free GPT-4]

[PDF] arxiv.org

Knowledge graphs meet multi-modal learning: A comprehensive survey

Z Chen, Y Zhang, Y Fang, Y Geng, L Guo… - arxiv preprint arxiv …, 2024 - arxiv.org

Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …

Save Cite Cited by 44 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

From images to textual prompts: Zero-shot visual question answering with frozen large language models

J Guo, J Li, D Li, AMH Tiong, B Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

Large language models (LLMs) have demonstrated excellent zero-shot generalization to
new language tasks. However, effective utilization of LLMs for zero-shot visual question …

Save Cite Cited by 139 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory

Z Hu, A Iscen, C Sun, Z Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this paper, we propose an end-to-end Retrieval-Augmented Visual Language Model
(REVEAL) that learns to encode world knowledge into a large-scale memory, and to retrieve …

Save Cite Cited by 80 Related articles All 7 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] neurips.cc

Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering

W Lin, J Chen, J Mei, A Coca… - Advances in Neural …, 2023 - proceedings.neurips.cc

Abstract Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to
utilize knowledge from external knowledge bases to answer visually-grounded questions …

Save Cite Cited by 40 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Kat: A knowledge augmented transformer for vision-and-language

L Gui, B Wang, Q Huang, A Hauptmann, Y Bisk… - arxiv preprint arxiv …, 2021 - arxiv.org

The primary focus of recent work with largescale transformers has been on optimizing the
amount of information packed into the model's parameters. In this work, we ask a different …

Save Cite Cited by 166 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Language models are general-purpose interfaces

Y Hao, H Song, L Dong, S Huang, Z Chi… - arxiv preprint arxiv …, 2022 - arxiv.org

Foundation models have received much attention due to their effectiveness across a broad
range of downstream applications. Though there is a big convergence in terms of …

Save Cite Cited by 108 Related articles All 2 versions Free GPT-4 View as HTML

Create alert

Cite

Advanced search

Saved to My library

Multi-modal answer validation for knowledge-based vqa

Vision-language pre-training: Basics, recent advances, and future trends

Prompting large language models with answer heuristics for knowledge-based visual question answering

An empirical study of gpt-3 for few-shot knowledge-based vqa

Multimodal foundation models: From specialists to general-purpose assistants

Knowledge graphs meet multi-modal learning: A comprehensive survey

From images to textual prompts: Zero-shot visual question answering with frozen large language models

Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory

Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering

Kat: A knowledge augmented transformer for vision-and-language

Language models are general-purpose interfaces