Foundations & trends in multimodal machine learning: Principles, challenges, and open questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Vipergpt: Visual inference via python execution for reasoning
Answering visual queries is a complex task that requires both visual processing and
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …
Chameleon: Plug-and-play compositional reasoning with large language models
Large language models (LLMs) have achieved remarkable progress in solving various
natural language processing tasks due to emergent reasoning abilities. However, LLMs …
natural language processing tasks due to emergent reasoning abilities. However, LLMs …
Visual programming: Compositional visual reasoning without training
We present VISPROG, a neuro-symbolic approach to solving complex and compositional
visual tasks given natural language instructions. VISPROG avoids the need for any task …
visual tasks given natural language instructions. VISPROG avoids the need for any task …
Multiscale feature extraction and fusion of image and text in VQA
Abstract The Visual Question Answering (VQA) system is the process of finding useful
information from images related to the question to answer the question correctly. It can be …
information from images related to the question to answer the question correctly. It can be …
Selection-inference: Exploiting large language models for interpretable logical reasoning
Large language models (LLMs) have been shown to be capable of impressive few-shot
generalisation to new tasks. However, they still tend to perform poorly on multi-step logical …
generalisation to new tasks. However, they still tend to perform poorly on multi-step logical …
Decomposed prompting: A modular approach for solving complex tasks
Few-shot prompting is a surprisingly powerful way to use Large Language Models (LLMs) to
solve various tasks. However, this approach struggles as the task complexity increases or …
solve various tasks. However, this approach struggles as the task complexity increases or …
On the opportunities and risks of foundation models
AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …
The all-seeing project v2: Towards general relation comprehension of the open world
Abstract We present the All-Seeing Project V2: a new model and dataset designed for
understanding object relations in images. Specifically, we propose the All-Seeing Model V2 …
understanding object relations in images. Specifically, we propose the All-Seeing Model V2 …