SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer

E **e, J Chen, Y Zhao, J Yu, L Zhu, Y Lin… - arxiv preprint arxiv …, 2025 - arxiv.org
This paper presents SANA-1.5, a linear Diffusion Transformer for efficient scaling in text-to-
image generation. Building upon SANA-1.0, we introduce three key innovations:(1) Efficient …

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

J Hong, S Yan, J Cai, X Jiang, Y Hu, W **e - arxiv preprint arxiv …, 2025 - arxiv.org
In this paper, we introduce WorldSense, the first benchmark to assess the multi-modal video
understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast …

Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models

Z Li, G Chen, S Liu, S Wang, V VS, Y Ji, S Lan… - arxiv preprint arxiv …, 2025 - arxiv.org
Recently, promising progress has been made by open-source vision-language models
(VLMs) in bringing their capabilities closer to those of proprietary frontier models. However …

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

L Fu, B Yang, Z Kuang, J Song, Y Li, L Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models
(LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted the …

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

B Zhang, K Li, Z Cheng, Z Hu, Y Yuan, G Chen… - arxiv preprint arxiv …, 2025 - arxiv.org
In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation model for
image and video understanding. The core design philosophy of VideoLLaMA3 is vision …

ViDDAR: Vision Language Model-Based Task-Detrimental Content Detection for Augmented Reality

Y **u, T Scargill, M Gorlatova - arxiv preprint arxiv:2501.12553, 2025 - arxiv.org
In Augmented Reality (AR), virtual content enhances user experience by providing
additional information. However, improperly positioned or designed virtual content can be …