Longvila: Scaling long-context visual language models for long videos

F Xue, Y Chen, D Li, Q Hu, L Zhu, X Li, Y Fang… - arxiv preprint arxiv …, 2024 - arxiv.org
Long-context capability is critical for multi-modal foundation models, especially for long
video understanding. We introduce LongVILA, a full-stack solution for long-context visual …

Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution

Z Liu, Y Dong, Z Liu, W Hu, J Lu, Y Rao - arxiv preprint arxiv:2409.12961, 2024 - arxiv.org
Visual data comes in various forms, ranging from small icons of just a few pixels to long
videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual …

Lvbench: An extreme long video understanding benchmark

W Wang, Z He, W Hong, Y Cheng, X Zhang, J Qi… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent progress in multimodal large language models has markedly enhanced the
understanding of short videos (typically under one minute), and several evaluation datasets …

Apollo: An exploration of video understanding in large multimodal models

O Zohar, X Wang, Y Dubois, N Mehta, T **ao… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …

Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability

S Chen, X Lan, Y Yuan, Z Jie, L Ma - arxiv preprint arxiv:2411.18211, 2024 - arxiv.org
Rapid development of large language models (LLMs) has significantly advanced multimodal
large language models (LMMs), particularly in vision-language tasks. However, existing …

xgen-mm-vid (blip-3-video): You only need 32 tokens to represent a video even in vlms

MS Ryoo, H Zhou, S Kendre, C Qin, L Xue… - arxiv preprint arxiv …, 2024 - arxiv.org
We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos,
particularly designed to efficiently capture temporal information over multiple frames. BLIP-3 …

Internlm-xcomposer2. 5-omnilive: A comprehensive multimodal system for long-term streaming video and audio interactions

P Zhang, X Dong, Y Cao, Y Zang, R Qian, X Wei… - arxiv preprint arxiv …, 2024 - arxiv.org
Creating AI systems that can interact with environments over long periods, similar to human
cognition, has been a longstanding research goal. Recent advancements in multimodal …

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

K Gong, K Feng, B Li, Y Wang, M Cheng… - arxiv preprint arxiv …, 2024 - arxiv.org
Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro,
and Reka Core, have expanded their capabilities to include vision and audio modalities …

ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models

J Chen, T Zhang, S Huang, Y Niu, L Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite the recent breakthroughs achieved by Large Vision Language Models (LVLMs) in
understanding and responding to complex visual-textual contexts, their inherent …

Do language models understand time?

X Ding, L Wang - arxiv preprint arxiv:2412.13845, 2024 - arxiv.org
Large language models (LLMs) have revolutionized video-based computer vision
applications, including action recognition, anomaly detection, and video summarization …