Llava-plus: Learning to use tools for creating multimodal agents

S Liu, H Cheng, H Liu, H Zhang, F Li, T Ren… - … on Computer Vision, 2024 - Springer
Abstract This paper presents LLaVA-Plus (L arge L anguage a nd V ision A ssistants that P
lug and L earn to U se S kills), a general-purpose multimodal assistant trained using an end …

Pali-x: On scaling up a multilingual vision and language model

X Chen, J Djolonga, P Padlewski, B Mustafa… - arxiv preprint arxiv …, 2023 - arxiv.org
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and
language model, both in terms of size of the components and the breadth of its training task …

Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering

W Lin, J Chen, J Mei, A Coca… - Advances in Neural …, 2023 - proceedings.neurips.cc
Abstract Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to
utilize knowledge from external knowledge bases to answer visually-grounded questions …

Avis: Autonomous visual information seeking with large language model agent

Z Hu, A Iscen, C Sun, KW Chang… - Advances in …, 2024 - proceedings.neurips.cc
In this paper, we propose an autonomous information seeking visual question answering
framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically …

Uniir: Training and benchmarking universal multimodal information retrievers

C Wei, Y Chen, H Chen, H Hu, G Zhang, J Fu… - … on Computer Vision, 2024 - Springer
Existing information retrieval (IR) models often assume a homogeneous format, limiting their
applicability to diverse user needs, such as searching for images with text descriptions …

Mindstorms in natural language-based societies of mind

M Zhuge, H Liu, F Faccio, DR Ashley… - arxiv preprint arxiv …, 2023 - arxiv.org
Both Minsky's" society of mind" and Schmidhuber's" learning to think" inspire diverse
societies of large multimodal neural networks (NNs) that solve problems by interviewing …

Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities

H Hu, Y Luan, Y Chen, U Khandelwal… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large-scale multi-modal pre-training models such as CLIP and PaLI exhibit strong
generalization on various visual domains and tasks. However, existing image classification …

Instruct-Imagen: Image generation with multi-modal instruction

H Hu, KCK Chan, YC Su, W Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract This paper presents Instruct-Imagen a model that tackles heterogeneous image
generation tasks and generalizes across unseen tasks. We introduce multi-modal instruction …

On Scaling Up a Multilingual Vision and Language Model

X Chen, J Djolonga, P Padlewski… - Proceedings of the …, 2024 - openaccess.thecvf.com
We explore the boundaries of scaling up a multilingual vision and language model both in
terms of size of the components and the breadth of its training task mixture. Our model …

A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering

Y Li, L Wang, B Hu, X Chen, W Zhong, C Lyu… - arxiv preprint arxiv …, 2023 - arxiv.org
The emergence of multimodal large models (MLMs) has significantly advanced the field of
visual understanding, offering remarkable capabilities in the realm of visual question …