- Academic Search

S Liu, H Cheng, H Liu, H Zhang, F Li, T Ren… - … on Computer Vision, 2024 - Springer

Abstract This paper presents LLaVA-Plus (L arge L anguage a nd V ision A ssistants that P
lug and L earn to U se S kills), a general-purpose multimodal assistant trained using an end …

Speichern Zitieren Zitiert von: 96 Ähnliche Artikel Alle 3 Versionen

[Free GPT-4]

[PDF] arxiv.org

Pali-x: On scaling up a multilingual vision and language model

X Chen, J Djolonga, P Padlewski, B Mustafa… - arxiv preprint arxiv …, 2023 - arxiv.org

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and
language model, both in terms of size of the components and the breadth of its training task …

Speichern Zitieren Zitiert von: 162 Ähnliche Artikel Alle 4 Versionen HTML-Version

[Free GPT-4]

[PDF] neurips.cc

Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering

W Lin, J Chen, J Mei, A Coca… - Advances in Neural …, 2023 - proceedings.neurips.cc

Abstract Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to
utilize knowledge from external knowledge bases to answer visually-grounded questions …

Speichern Zitieren Zitiert von: 41 Ähnliche Artikel Alle 5 Versionen HTML-Version

[Free GPT-4]

[PDF] neurips.cc

Avis: Autonomous visual information seeking with large language model agent

Z Hu, A Iscen, C Sun, KW Chang… - Advances in …, 2024 - proceedings.neurips.cc

In this paper, we propose an autonomous information seeking visual question answering
framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically …

Speichern Zitieren Zitiert von: 59 Ähnliche Artikel Alle 6 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Uniir: Training and benchmarking universal multimodal information retrievers

C Wei, Y Chen, H Chen, H Hu, G Zhang, J Fu… - … on Computer Vision, 2024 - Springer

Existing information retrieval (IR) models often assume a homogeneous format, limiting their
applicability to diverse user needs, such as searching for images with text descriptions …

Speichern Zitieren Zitiert von: 31 Ähnliche Artikel Alle 3 Versionen

[Free GPT-4]

[PDF] arxiv.org

Mindstorms in natural language-based societies of mind

M Zhuge, H Liu, F Faccio, DR Ashley… - arxiv preprint arxiv …, 2023 - arxiv.org

Both Minsky's" society of mind" and Schmidhuber's" learning to think" inspire diverse
societies of large multimodal neural networks (NNs) that solve problems by interviewing …

Speichern Zitieren Zitiert von: 69 Ähnliche Artikel Alle 7 Versionen HTML-Version

[Free GPT-4]

[PDF] thecvf.com

Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities

H Hu, Y Luan, Y Chen, U Khandelwal… - Proceedings of the …, 2023 - openaccess.thecvf.com

Large-scale multi-modal pre-training models such as CLIP and PaLI exhibit strong
generalization on various visual domains and tasks. However, existing image classification …

Speichern Zitieren Zitiert von: 48 Ähnliche Artikel Alle 5 Versionen HTML-Version

[Free GPT-4]

[PDF] thecvf.com

Instruct-Imagen: Image generation with multi-modal instruction

H Hu, KCK Chan, YC Su, W Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract This paper presents Instruct-Imagen a model that tackles heterogeneous image
generation tasks and generalizes across unseen tasks. We introduce multi-modal instruction …

Speichern Zitieren Zitiert von: 32 Ähnliche Artikel Alle 4 Versionen HTML-Version

[Free GPT-4]

[PDF] thecvf.com

On Scaling Up a Multilingual Vision and Language Model

X Chen, J Djolonga, P Padlewski… - Proceedings of the …, 2024 - openaccess.thecvf.com

We explore the boundaries of scaling up a multilingual vision and language model both in
terms of size of the components and the breadth of its training task mixture. Our model …

Speichern Zitieren Zitiert von: 6 Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] arxiv.org

A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering

Y Li, L Wang, B Hu, X Chen, W Zhong, C Lyu… - arxiv preprint arxiv …, 2023 - arxiv.org

The emergence of multimodal large models (MLMs) has significantly advanced the field of
visual understanding, offering remarkable capabilities in the realm of visual question …

Speichern Zitieren Zitiert von: 34 Ähnliche Artikel Alle 3 Versionen HTML-Version

Alert erstellen

Zitieren

Erweiterte Suche

In „Meine Bibliothek“ gespeichert

Can pre-trained vision and language models answer visual information-seeking questions?

Llava-plus: Learning to use tools for creating multimodal agents

Pali-x: On scaling up a multilingual vision and language model

Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering

Avis: Autonomous visual information seeking with large language model agent

Uniir: Training and benchmarking universal multimodal information retrievers

Mindstorms in natural language-based societies of mind

Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities

Instruct-Imagen: Image generation with multi-modal instruction

On Scaling Up a Multilingual Vision and Language Model

A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering