- Academic Search

C Fu, YF Zhang, S Yin, B Li, X Fang, S Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …

Simpan Kutip Dirujuk 5 kali Artikel terkait 2 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

X Wang, D Song, S Chen, C Zhang, B Wang - arxiv preprint arxiv …, 2024 - arxiv.org

Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is
crucial for video understanding, high-resolution image understanding, and multi-modal …

Simpan Kutip Dirujuk 19 kali Artikel terkait 3 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Video-xl: Extra-long vision language model for hour-scale video understanding

Y Shu, P Zhang, Z Liu, M Qin, J Zhou, T Huang… - arxiv preprint arxiv …, 2024 - arxiv.org

Although current Multi-modal Large Language Models (MLLMs) demonstrate promising
results in video understanding, processing extremely long videos remains an ongoing …

Simpan Kutip Dirujuk 15 kali Artikel terkait 2 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A survey on multimodal benchmarks: In the era of large ai models

L Li, G Chen, H Shi, J **ao, L Chen - arxiv preprint arxiv:2409.18142, 2024 - arxiv.org

The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial
advancements in artificial intelligence, significantly enhancing the capability to understand …

Simpan Kutip Dirujuk 4 kali Artikel terkait 2 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Videochat-flash: Hierarchical compression for long-context video modeling

X Li, Y Wang, J Yu, X Zeng, Y Zhu, H Huang… - arxiv preprint arxiv …, 2024 - arxiv.org

Long-context modeling is a critical capability for multimodal large language models
(MLLMs), enabling them to process long-form contents with implicit memorization. Despite …

Simpan Kutip Dirujuk 1 kali Artikel terkait 2 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

TH Wu, G Biamby, J Quenum, R Gupta… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Multimodal Models (LMMs) have made significant strides in visual question-
answering for single images. Recent advancements like long-context LMMs have allowed …

Simpan Kutip Dirujuk 1 kali Artikel terkait 2 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

InternVideo2. 5: Empowering Video MLLMs with Long and Rich Context Modeling

Y Wang, X Li, Z Yan, Y He, J Yu, X Zeng… - arxiv preprint arxiv …, 2025 - arxiv.org

This paper aims to improve the performance of video multimodal large language models
(MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of …

Simpan Kutip Artikel terkait 2 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

VCBench: A Controllable Benchmark for Symbolic and Abstract Challenges in Video Cognition

C Li, Q Chen, Z Li, F Tao, Y Zhang - arxiv preprint arxiv:2411.09105, 2024 - arxiv.org

Recent advancements in Large Video-Language Models (LVLMs) have driven the
development of benchmarks designed to assess cognitive abilities in video-based tasks …

Simpan Kutip Artikel terkait 2 versi Versi HTML

Buat notifikasi

Kutip

Penelusuran lanjutan

Disimpan ke Koleksi saya

Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs

Mme-survey: A comprehensive survey on evaluation of multimodal llms

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

Video-xl: Extra-long vision language model for hour-scale video understanding

A survey on multimodal benchmarks: In the era of large ai models

Videochat-flash: Hierarchical compression for long-context video modeling

Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

InternVideo2. 5: Empowering Video MLLMs with Long and Rich Context Modeling

VCBench: A Controllable Benchmark for Symbolic and Abstract Challenges in Video Cognition