Cambrian-1: A fully open, vision-centric exploration of multimodal llms
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …
centric approach. While stronger language models can enhance multimodal capabilities, the …
Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …
supports long-contextual input and output. IXC-2.5 excels in various text-image …
Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …
Nvlm: Open frontier-class multimodal llms
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs)
that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary …
that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary …
Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning
We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …
to enhance capabilities in text-rich image understanding, visual referring and grounding …
Visualagentbench: Towards large multimodal models as visual foundation agents
Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence,
merging capabilities in both language and vision to form highly capable Visual Foundation …
merging capabilities in both language and vision to form highly capable Visual Foundation …
Automatically generating UI code from screenshot: A divide-and-conquer-based approach
Websites are critical in today's digital world, with over 1.11 billion currently active and
approximately 252,000 new sites launched daily. Converting website layout design into …
approximately 252,000 new sites launched daily. Converting website layout design into …
Omchat: A recipe to train multimodal language models with strong long context and video understanding
We introduce OmChat, a model designed to excel in handling long contexts and video
understanding tasks. OmChat's new architecture standardizes how different visual inputs are …
understanding tasks. OmChat's new architecture standardizes how different visual inputs are …
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks
Coding tasks have been valuable for evaluating Large Language Models (LLMs), as they
demand the comprehension of high-level instructions, complex reasoning, and the …
demand the comprehension of high-level instructions, complex reasoning, and the …
Baichuan-Omni-1.5 Technical Report
Y Li, J Liu, T Zhang, S Chen, T Li, Z Li, L Liu… - arxiv preprint arxiv …, 2025 - arxiv.org
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal
understanding capabilities but also provides end-to-end audio generation capabilities. To …
understanding capabilities but also provides end-to-end audio generation capabilities. To …