- Academic Search

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Spara Citera Citerat av 205 Relaterade artiklar Alla 2 versionerna Se som HTML-version

[Free GPT-4]

[PDF] oup.com

A Survey of Multimodel Large Language Models

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org

With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

Spara Citera Citerat av 156 Relaterade artiklar Alla 7 versionerna

[Free GPT-4]

[PDF] arxiv.org

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

G Team, P Georgiev, VI Lei, R Burnell, L Bai… - arxiv preprint arxiv …, 2024 - arxiv.org

In this report, we introduce the Gemini 1.5 family of models, representing the next generation
of highly compute-efficient multimodal models capable of recalling and reasoning over fine …

Spara Citera Citerat av 945 Relaterade artiklar Alla 4 versionerna Se som HTML-version

[Free GPT-4]

[PDF] arxiv.org

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer

In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Spara Citera Citerat av 351 Relaterade artiklar Alla 2 versionerna

[Free GPT-4]

[PDF] neurips.cc

Chameleon: Plug-and-play compositional reasoning with large language models

P Lu, B Peng, H Cheng, M Galley… - Advances in …, 2024 - proceedings.neurips.cc

Large language models (LLMs) have achieved remarkable progress in solving various
natural language processing tasks due to emergent reasoning abilities. However, LLMs …

Spara Citera Citerat av 396 Relaterade artiklar Alla 10 versionerna Se som HTML-version

[Free GPT-4]

[PDF] thecvf.com

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

X Yue, Y Ni, K Zhang, T Zheng, R Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …

Spara Citera Citerat av 533 Relaterade artiklar Alla 3 versionerna Se som HTML-version

[Free GPT-4]

[PDF] arxiv.org

MM1: methods, analysis and insights from multimodal LLM pre-training

B McKinzie, Z Gan, JP Fauconnier, S Dodge… - … on Computer Vision, 2024 - Springer

In this work, we discuss building performant Multimodal Large Language Models (MLLMs).
In particular, we study the importance of various architecture components and data choices …

Spara Citera Citerat av 180 Relaterade artiklar Alla 2 versionerna

[Free GPT-4]

[PDF] arxiv.org

Llava-onevision: Easy visual task transfer

B Li, Y Zhang, D Guo, R Zhang, F Li, H Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed
by consolidating our insights into data, models, and visual representations in the LLaVA …

Spara Citera Citerat av 250 Relaterade artiklar Alla 3 versionerna Se som HTML-version

[Free GPT-4]

[PDF] arxiv.org

Deepseek-vl: towards real-world vision-language understanding

H Lu, W Liu, B Zhang, B Wang, K Dong, B Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-
world vision and language understanding applications. Our approach is structured around …

Spara Citera Citerat av 196 Relaterade artiklar Alla 4 versionerna Se som HTML-version

[Free GPT-4]

[PDF] arxiv.org

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

S Tong, E Brown, P Wu, S Woo, M Middepogu… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

Spara Citera Citerat av 172 Relaterade artiklar Alla 4 versionerna Se som HTML-version

Citera

Avancerad sökning

Har sparats i Mitt bibliotek

Mm-llms: Recent advances in multimodal large language models

A Survey of Multimodel Large Language Models

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Chameleon: Plug-and-play compositional reasoning with large language models

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

MM1: methods, analysis and insights from multimodal LLM pre-training

Llava-onevision: Easy visual task transfer

Deepseek-vl: towards real-world vision-language understanding

Cambrian-1: A fully open, vision-centric exploration of multimodal llms