Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models

M Deitke, C Clark, S Lee, R Tripathi, Y Yang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Today's most advanced multimodal models remain proprietary. The strongest open-weight
models rely heavily on synthetic data from proprietary VLMs to achieve good performance …

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024‏ - arxiv.org
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

Lvbench: An extreme long video understanding benchmark

W Wang, Z He, W Hong, Y Cheng, X Zhang, J Qi… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Recent progress in multimodal large language models has markedly enhanced the
understanding of short videos (typically under one minute), and several evaluation datasets …

Benchmark evaluations, applications, and challenges of large vision language models: A survey

Z Li, X Wu, H Du, H Nghiem, G Shi - arxiv preprint arxiv:2501.02189, 2025‏ - arxiv.org
Multimodal Vision Language Models (VLMs) have emerged as a transformative technology
at the intersection of computer vision and natural language processing, enabling machines …

Hart: Efficient visual generation with hybrid autoregressive transformer

H Tang, Y Wu, S Yang, E **e, J Chen, J Chen… - arxiv preprint arxiv …, 2024‏ - arxiv.org
We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual
generation model capable of directly generating 1024x1024 images, rivaling diffusion …

Llava-critic: Learning to evaluate multimodal models

T **ong, X Wang, D Guo, Q Ye, H Fan, Q Gu… - arxiv preprint arxiv …, 2024‏ - arxiv.org
We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as
a generalist evaluator to assess performance across a wide range of multimodal tasks …

Mora: Enabling generalist video generation via a multi-agent framework

Z Yuan, Y Liu, Y Cao, W Sun, H Jia, R Chen… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Text-to-video generation has made significant strides, but replicating the capabilities of
advanced systems like OpenAI Sora remains challenging due to their closed-source nature …

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

Z Wu, X Chen, Z Pan, X Liu, W Liu, D Dai… - arxiv preprint arxiv …, 2024‏ - arxiv.org
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …

Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance

Z Gao, Z Chen, E Cui, Y Ren, W Wang, J Zhu, H Tian… - Visual Intelligence, 2024‏ - Springer
Multi-modal large language models (MLLMs) have demonstrated impressive performance in
vision-language tasks across a wide range of domains. However, the large model scale and …