How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer
In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

A survey of multimodal large language model from a data-centric perspective

T Bai, H Liang, B Wan, Y Xu, X Li, S Li, L Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) enhance the capabilities of standard large
language models by integrating and processing data from multiple modalities, including text …

Is sora a world simulator? a comprehensive survey on general world models and beyond

Z Zhu, X Wang, W Zhao, C Min, N Deng, M Dou… - arxiv preprint arxiv …, 2024 - arxiv.org
General world models represent a crucial pathway toward achieving Artificial General
Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual …

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

Longvu: Spatiotemporal adaptive compression for long video-language understanding

X Shen, Y **ong, C Zhao, L Wu, J Chen, C Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) have shown promising progress in
understanding and analyzing video content. However, processing long videos remains a …

Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation

S Yuan, J Huang, Y Xu, Y Liu, S Zhang, Y Shi… - arxiv preprint arxiv …, 2024 - arxiv.org
We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to
evaluate the temporal and metamorphic capabilities of the T2V models (eg Sora and …

Unimd: Towards unifying moment retrieval and temporal action detection

Y Zeng, Y Zhong, C Feng, L Ma - European Conference on Computer …, 2024 - Springer
Abstract Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while
Moment Retrieval (MR) aims to identify the events described by open-ended natural …

Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models

H Wang, Z Xu, Y Cheng, S Diao, Y Zhou, Y Cao… - arxiv preprint arxiv …, 2024 - arxiv.org
Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in
coarse-grained video understanding, however, they struggle with fine-grained temporal …

Apollo: An exploration of video understanding in large multimodal models

O Zohar, X Wang, Y Dubois, N Mehta, T **ao… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …