Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models
M Deitke, C Clark, S Lee, R Tripathi, Y Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Today's most advanced multimodal models remain proprietary. The strongest open-weight
models rely heavily on synthetic data from proprietary VLMs to achieve good performance …
models rely heavily on synthetic data from proprietary VLMs to achieve good performance …
Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …
Lvbench: An extreme long video understanding benchmark
Recent progress in multimodal large language models has markedly enhanced the
understanding of short videos (typically under one minute), and several evaluation datasets …
understanding of short videos (typically under one minute), and several evaluation datasets …
Benchmark evaluations, applications, and challenges of large vision language models: A survey
Multimodal Vision Language Models (VLMs) have emerged as a transformative technology
at the intersection of computer vision and natural language processing, enabling machines …
at the intersection of computer vision and natural language processing, enabling machines …
Hart: Efficient visual generation with hybrid autoregressive transformer
We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual
generation model capable of directly generating 1024x1024 images, rivaling diffusion …
generation model capable of directly generating 1024x1024 images, rivaling diffusion …
Llava-critic: Learning to evaluate multimodal models
We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as
a generalist evaluator to assess performance across a wide range of multimodal tasks …
a generalist evaluator to assess performance across a wide range of multimodal tasks …
Mora: Enabling generalist video generation via a multi-agent framework
Text-to-video generation has made significant strides, but replicating the capabilities of
advanced systems like OpenAI Sora remains challenging due to their closed-source nature …
advanced systems like OpenAI Sora remains challenging due to their closed-source nature …
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …
Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …
Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance
Multi-modal large language models (MLLMs) have demonstrated impressive performance in
vision-language tasks across a wide range of domains. However, the large model scale and …
vision-language tasks across a wide range of domains. However, the large model scale and …