How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites
In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …
(MLLM) to bridge the capability gap between open-source and proprietary commercial …
Video understanding with large language models: A survey
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …
content, the demand for proficient video understanding tools has intensified markedly. Given …
A survey of multimodal large language model from a data-centric perspective
Multimodal large language models (MLLMs) enhance the capabilities of standard large
language models by integrating and processing data from multiple modalities, including text …
language models by integrating and processing data from multiple modalities, including text …
Is sora a world simulator? a comprehensive survey on general world models and beyond
General world models represent a crucial pathway toward achieving Artificial General
Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual …
Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual …
Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …
Longvu: Spatiotemporal adaptive compression for long video-language understanding
Multimodal Large Language Models (MLLMs) have shown promising progress in
understanding and analyzing video content. However, processing long videos remains a …
understanding and analyzing video content. However, processing long videos remains a …
Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation
We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to
evaluate the temporal and metamorphic capabilities of the T2V models (eg Sora and …
evaluate the temporal and metamorphic capabilities of the T2V models (eg Sora and …
Unimd: Towards unifying moment retrieval and temporal action detection
Abstract Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while
Moment Retrieval (MR) aims to identify the events described by open-ended natural …
Moment Retrieval (MR) aims to identify the events described by open-ended natural …
Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models
Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in
coarse-grained video understanding, however, they struggle with fine-grained temporal …
coarse-grained video understanding, however, they struggle with fine-grained temporal …
Apollo: An exploration of video understanding in large multimodal models
Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …
(LMMs), the underlying mechanisms driving their video understanding remain poorly …