Llava-plus: Learning to use tools for creating multimodal agents
Abstract This paper presents LLaVA-Plus (L arge L anguage a nd V ision A ssistants that P
lug and L earn to U se S kills), a general-purpose multimodal assistant trained using an end …
lug and L earn to U se S kills), a general-purpose multimodal assistant trained using an end …
Pali-x: On scaling up a multilingual vision and language model
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and
language model, both in terms of size of the components and the breadth of its training task …
language model, both in terms of size of the components and the breadth of its training task …
Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering
Abstract Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to
utilize knowledge from external knowledge bases to answer visually-grounded questions …
utilize knowledge from external knowledge bases to answer visually-grounded questions …
Avis: Autonomous visual information seeking with large language model agent
In this paper, we propose an autonomous information seeking visual question answering
framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically …
framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically …
Uniir: Training and benchmarking universal multimodal information retrievers
Existing information retrieval (IR) models often assume a homogeneous format, limiting their
applicability to diverse user needs, such as searching for images with text descriptions …
applicability to diverse user needs, such as searching for images with text descriptions …
Mindstorms in natural language-based societies of mind
Both Minsky's" society of mind" and Schmidhuber's" learning to think" inspire diverse
societies of large multimodal neural networks (NNs) that solve problems by interviewing …
societies of large multimodal neural networks (NNs) that solve problems by interviewing …
Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities
Large-scale multi-modal pre-training models such as CLIP and PaLI exhibit strong
generalization on various visual domains and tasks. However, existing image classification …
generalization on various visual domains and tasks. However, existing image classification …
Instruct-Imagen: Image generation with multi-modal instruction
Abstract This paper presents Instruct-Imagen a model that tackles heterogeneous image
generation tasks and generalizes across unseen tasks. We introduce multi-modal instruction …
generation tasks and generalizes across unseen tasks. We introduce multi-modal instruction …
On Scaling Up a Multilingual Vision and Language Model
We explore the boundaries of scaling up a multilingual vision and language model both in
terms of size of the components and the breadth of its training task mixture. Our model …
terms of size of the components and the breadth of its training task mixture. Our model …
A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering
The emergence of multimodal large models (MLMs) has significantly advanced the field of
visual understanding, offering remarkable capabilities in the realm of visual question …
visual understanding, offering remarkable capabilities in the realm of visual question …