Dissociating language and thought in large language models
Large language models (LLMs) have come closest among all models to date to mastering
human language, yet opinions about their linguistic and cognitive capabilities remain split …
human language, yet opinions about their linguistic and cognitive capabilities remain split …
Foundations & trends in multimodal machine learning: Principles, challenges, and open questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
Vipergpt: Visual inference via python execution for reasoning
Answering visual queries is a complex task that requires both visual processing and
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …
Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Selection-inference: Exploiting large language models for interpretable logical reasoning
Large language models (LLMs) have been shown to be capable of impressive few-shot
generalisation to new tasks. However, they still tend to perform poorly on multi-step logical …
generalisation to new tasks. However, they still tend to perform poorly on multi-step logical …
On the opportunities and risks of foundation models
AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …
Mdetr-modulated detection for end-to-end multi-modal understanding
Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of
interest from the image. However, this crucial module is typically used as a black box …
interest from the image. However, this crucial module is typically used as a black box …
Vinvl: Revisiting visual representations in vision-language models
This paper presents a detailed study of improving vision features and develops an improved
object detection model for vision language (VL) tasks. Compared to the most widely used …
object detection model for vision language (VL) tasks. Compared to the most widely used …
[HTML][HTML] Cpt: Colorful prompt tuning for pre-trained vision-language models
Abstract Vision-Language Pre-training (VLP) models have shown promising capabilities in
grounding natural language in image data, facilitating a broad range of cross-modal tasks …
grounding natural language in image data, facilitating a broad range of cross-modal tasks …
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Large-scale pre-training methods of learning cross-modal representations on image-text
pairs are becoming popular for vision-language tasks. While existing methods simply …
pairs are becoming popular for vision-language tasks. While existing methods simply …