Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Prompting large language models with answer heuristics for knowledge-based visual question answering
Abstract Knowledge-based visual question answering (VQA) requires external knowledge
beyond the image to answer the question. Early studies retrieve required knowledge from …
beyond the image to answer the question. Early studies retrieve required knowledge from …
An empirical study of gpt-3 for few-shot knowledge-based vqa
Abstract Knowledge-based visual question answering (VQA) involves answering questions
that require external knowledge not present in the image. Existing methods first retrieve …
that require external knowledge not present in the image. Existing methods first retrieve …
Multimodal foundation models: From specialists to general-purpose assistants
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …
methods to data compression. Recent advances in statistical machine learning have opened …
Knowledge graphs meet multi-modal learning: A comprehensive survey
Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …
semantic web community's exploration into multi-modal dimensions unlocking new avenues …
From images to textual prompts: Zero-shot visual question answering with frozen large language models
Large language models (LLMs) have demonstrated excellent zero-shot generalization to
new language tasks. However, effective utilization of LLMs for zero-shot visual question …
new language tasks. However, effective utilization of LLMs for zero-shot visual question …
Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory
In this paper, we propose an end-to-end Retrieval-Augmented Visual Language Model
(REVEAL) that learns to encode world knowledge into a large-scale memory, and to retrieve …
(REVEAL) that learns to encode world knowledge into a large-scale memory, and to retrieve …
Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering
Abstract Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to
utilize knowledge from external knowledge bases to answer visually-grounded questions …
utilize knowledge from external knowledge bases to answer visually-grounded questions …
Kat: A knowledge augmented transformer for vision-and-language
The primary focus of recent work with largescale transformers has been on optimizing the
amount of information packed into the model's parameters. In this work, we ask a different …
amount of information packed into the model's parameters. In this work, we ask a different …
Language models are general-purpose interfaces
Foundation models have received much attention due to their effectiveness across a broad
range of downstream applications. Though there is a big convergence in terms of …
range of downstream applications. Though there is a big convergence in terms of …