Cambrian-1: A fully open, vision-centric exploration of multimodal llms

S Tong, E Brown, P Wu, S Woo, M Middepogu… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

P Zhang, X Dong, Y Zang, Y Cao, R Qian… - arxiv preprint arxiv …, 2024 - arxiv.org
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

Nvlm: Open frontier-class multimodal llms

W Dai, N Lee, B Wang, Z Yang, Z Liu, J Barker… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs)
that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary …

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning

H Zhang, M Gao, Z Gan, P Dufter, N Wenzel… - arxiv preprint arxiv …, 2024 - arxiv.org
We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …

Visualagentbench: Towards large multimodal models as visual foundation agents

X Liu, T Zhang, Y Gu, IL Iong, Y Xu, X Song… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence,
merging capabilities in both language and vision to form highly capable Visual Foundation …

Automatically generating UI code from screenshot: A divide-and-conquer-based approach

Y Wan, C Wang, Y Dong, W Wang, S Li, Y Huo… - arxiv preprint arxiv …, 2024 - arxiv.org
Websites are critical in today's digital world, with over 1.11 billion currently active and
approximately 252,000 new sites launched daily. Converting website layout design into …

Omchat: A recipe to train multimodal language models with strong long context and video understanding

T Zhao, Q Zhang, K Lee, P Liu, L Zhang, C Fang… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce OmChat, a model designed to excel in handling long contexts and video
understanding tasks. OmChat's new architecture standardizes how different visual inputs are …

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

F Zhang, L Wu, H Bai, G Lin, X Li, X Yu, Y Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Coding tasks have been valuable for evaluating Large Language Models (LLMs), as they
demand the comprehension of high-level instructions, complex reasoning, and the …

Baichuan-Omni-1.5 Technical Report

Y Li, J Liu, T Zhang, S Chen, T Li, Z Li, L Liu… - arxiv preprint arxiv …, 2025 - arxiv.org
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal
understanding capabilities but also provides end-to-end audio generation capabilities. To …