SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer
This paper presents SANA-1.5, a linear Diffusion Transformer for efficient scaling in text-to-
image generation. Building upon SANA-1.0, we introduce three key innovations:(1) Efficient …
image generation. Building upon SANA-1.0, we introduce three key innovations:(1) Efficient …
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
In this paper, we introduce WorldSense, the first benchmark to assess the multi-modal video
understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast …
understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast …
Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models
Recently, promising progress has been made by open-source vision-language models
(VLMs) in bringing their capabilities closer to those of proprietary frontier models. However …
(VLMs) in bringing their capabilities closer to those of proprietary frontier models. However …
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
L Fu, B Yang, Z Kuang, J Song, Y Li, L Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models
(LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted the …
(LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted the …
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation model for
image and video understanding. The core design philosophy of VideoLLaMA3 is vision …
image and video understanding. The core design philosophy of VideoLLaMA3 is vision …
ViDDAR: Vision Language Model-Based Task-Detrimental Content Detection for Augmented Reality
In Augmented Reality (AR), virtual content enhances user experience by providing
additional information. However, improperly positioned or designed virtual content can be …
additional information. However, improperly positioned or designed virtual content can be …