- Academic Search

X Ding, L Wang - arxiv preprint arxiv:2412.13845, 2024 - arxiv.org

Large language models (LLMs) have revolutionized video-based computer vision
applications, including action recognition, anomaly detection, and video summarization …

Tallenna Viittaa Viittausten määrä 3 Aiheeseen liittyviä artikkeleita Kaikki 3 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

InternVideo2. 5: Empowering Video MLLMs with Long and Rich Context Modeling

Y Wang, X Li, Z Yan, Y He, J Yu, X Zeng… - arxiv preprint arxiv …, 2025 - arxiv.org

This paper aims to improve the performance of video multimodal large language models
(MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of …

Tallenna Viittaa Aiheeseen liittyviä artikkeleita Kaikki 2 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

B Zhang, K Li, Z Cheng, Z Hu, Y Yuan, G Chen… - arxiv preprint arxiv …, 2025 - arxiv.org

In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation model for
image and video understanding. The core design philosophy of VideoLLaMA3 is vision …

Tallenna Viittaa Aiheeseen liittyviä artikkeleita Kaikki 2 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

P Hansen-Estruch, D Yan, CY Chung, O Zohar… - arxiv preprint arxiv …, 2025 - arxiv.org

Visual tokenization via auto-encoding empowers state-of-the-art image and video
generative models by compressing pixels into a latent space. Although scaling Transformer …

Tallenna Viittaa Viittausten määrä 1 Aiheeseen liittyviä artikkeleita Kaikki 2 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Temporal Preference Optimization for Long-Form Video Understanding

R Li, X Wang, Y Zhang, Z Wang… - arxiv preprint arxiv …, 2025 - arxiv.org

Despite significant advancements in video large multimodal models (video-LMMs),
achieving effective temporal grounding in long-form videos remains a challenge for existing …

Tallenna Viittaa Aiheeseen liittyviä artikkeleita HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Redundancy Principles for MLLMs Benchmarks

Z Zhang, X Zhao, X Fang, C Li, X Liu, X Min… - arxiv preprint arxiv …, 2025 - arxiv.org

With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving
demands of the field, the number of benchmarks produced annually has surged into the …

Tallenna Viittaa Aiheeseen liittyviä artikkeleita Kaikki 3 versiota HTML-versio

Luo ilmoitus

Viittaa

Tarkennettu haku

Tallennettu omaan kirjastoon

Apollo: An exploration of video understanding in large multimodal models

Do language models understand time?

InternVideo2. 5: Empowering Video MLLMs with Long and Rich Context Modeling

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Temporal Preference Optimization for Long-Form Video Understanding

Redundancy Principles for MLLMs Benchmarks