- Academic Search

S Wang, Y Zhu, H Liu, Z Zheng, C Chen, J Li - ACM Computing Surveys, 2024 - dl.acm.org

Large Language Models (LLMs) have recently transformed both the academic and industrial
landscapes due to their remarkable capacity to understand, analyze, and generate texts …

Speichern Zitieren Zitiert von: 103 Ähnliche Artikel Alle 2 Versionen

[Free GPT-4]

[PDF] oup.com

A Survey of Multimodel Large Language Models

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org

With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

Speichern Zitieren Zitiert von: 151 Ähnliche Artikel Alle 7 Versionen

[Free GPT-4]

[PDF] arxiv.org

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer

In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Speichern Zitieren Zitiert von: 348 Ähnliche Artikel Alle 2 Versionen

[Free GPT-4]

[PDF] thecvf.com

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

X Yue, Y Ni, K Zhang, T Zheng, R Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …

Speichern Zitieren Zitiert von: 526 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

MM1: methods, analysis and insights from multimodal LLM pre-training

B McKinzie, Z Gan, JP Fauconnier, S Dodge… - … on Computer Vision, 2024 - Springer

In this work, we discuss building performant Multimodal Large Language Models (MLLMs).
In particular, we study the importance of various architecture components and data choices …

Speichern Zitieren Zitiert von: 179 Ähnliche Artikel Alle 2 Versionen

[Free GPT-4]

[PDF] arxiv.org

Deepseek-vl: towards real-world vision-language understanding

H Lu, W Liu, B Zhang, B Wang, K Dong, B Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-
world vision and language understanding applications. Our approach is structured around …

Speichern Zitieren Zitiert von: 191 Ähnliche Artikel Alle 4 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Cogvlm: Visual expert for pretrained language models

W Wang, Q Lv, W Yu, W Hong, J Qi, Y Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

We introduce CogVLM, a powerful open-source visual language foundation model. Different
from the popular shallow alignment method which maps image features into the input space …

Speichern Zitieren Zitiert von: 551 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?

R Zhang, D Jiang, Y Zhang, H Lin, Z Guo, P Qiu… - … on Computer Vision, 2024 - Springer

The remarkable progress of Multi-modal Large Language Models (MLLMs) has gained
unparalleled attention. However, their capabilities in visual math problem-solving remain …

Speichern Zitieren Zitiert von: 105 Ähnliche Artikel Alle 2 Versionen

[Free GPT-4]

[PDF] arxiv.org

Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images

Z Guo, R Xu, Y Yao, J Cui, Z Ni, C Ge, TS Chua… - … on Computer Vision, 2024 - Springer

Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding
the visual world. Conventional LMMs process images in fixed sizes and limited resolutions …

Speichern Zitieren Zitiert von: 90 Ähnliche Artikel Alle 2 Versionen

[Free GPT-4]

[PDF] arxiv.org

Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models

F Li, R Zhang, H Zhang, Y Zhang, B Li, W Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Visual instruction tuning has made considerable strides in enhancing the capabilities of
Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single …

Speichern Zitieren Zitiert von: 100 Ähnliche Artikel Alle 2 Versionen HTML-Version

Alert erstellen

Zitieren

Erweiterte Suche

In „Meine Bibliothek“ gespeichert

Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language...

Knowledge editing for large language models: A survey

A Survey of Multimodel Large Language Models

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

MM1: methods, analysis and insights from multimodal LLM pre-training

Deepseek-vl: towards real-world vision-language understanding

Cogvlm: Visual expert for pretrained language models

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?

Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images

Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models