- Academic Search

S Tong, E Brown, P Wu, S Woo, M Middepogu… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

Save Cite Cited by 172 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

P Zhang, X Dong, Y Zang, Y Cao, R Qian… - arxiv preprint arxiv …, 2024 - arxiv.org

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …

Save Cite Cited by 69 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

Save Cite Cited by 25 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Nvlm: Open frontier-class multimodal llms

W Dai, N Lee, B Wang, Z Yang, Z Liu, J Barker… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs)
that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary …

Save Cite Cited by 21 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning

H Zhang, M Gao, Z Gan, P Dufter, N Wenzel… - arxiv preprint arxiv …, 2024 - arxiv.org

We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …

Save Cite Cited by 15 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Visualagentbench: Towards large multimodal models as visual foundation agents

X Liu, T Zhang, Y Gu, IL Iong, Y Xu, X Song… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence,
merging capabilities in both language and vision to form highly capable Visual Foundation …

Save Cite Cited by 8 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Automatically generating UI code from screenshot: A divide-and-conquer-based approach

Y Wan, C Wang, Y Dong, W Wang, S Li, Y Huo… - arxiv preprint arxiv …, 2024 - arxiv.org

Websites are critical in today's digital world, with over 1.11 billion currently active and
approximately 252,000 new sites launched daily. Converting website layout design into …

Save Cite Cited by 11 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Omchat: A recipe to train multimodal language models with strong long context and video understanding

T Zhao, Q Zhang, K Lee, P Liu, L Zhang, C Fang… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce OmChat, a model designed to excel in handling long contexts and video
understanding tasks. OmChat's new architecture standardizes how different visual inputs are …

Save Cite Cited by 6 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

F Zhang, L Wu, H Bai, G Lin, X Li, X Yu, Y Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Coding tasks have been valuable for evaluating Large Language Models (LLMs), as they
demand the comprehension of high-level instructions, complex reasoning, and the …

Save Cite Cited by 1 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Baichuan-Omni-1.5 Technical Report

Y Li, J Liu, T Zhang, S Chen, T Li, Z Li, L Liu… - arxiv preprint arxiv …, 2025 - arxiv.org

We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal
understanding capabilities but also provides end-to-end audio generation capabilities. To …

Create alert

Cite

Advanced search

Saved to My library

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Nvlm: Open frontier-class multimodal llms

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning

Visualagentbench: Towards large multimodal models as visual foundation agents

Automatically generating UI code from screenshot: A divide-and-conquer-based approach

Omchat: A recipe to train multimodal language models with strong long context and video understanding

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

Baichuan-Omni-1.5 Technical Report