Video understanding with large language models: A survey
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …
content, the demand for proficient video understanding tools has intensified markedly. Given …
Slowfast-llava: A strong training-free baseline for video large language models
We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language
model (LLM) that can jointly capture detailed spatial semantics and long-range temporal …
model (LLM) that can jointly capture detailed spatial semantics and long-range temporal …
Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution
Visual data comes in various forms, ranging from small icons of just a few pixels to long
videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual …
videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual …
Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …
Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models
Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in
coarse-grained video understanding, however, they struggle with fine-grained temporal …
coarse-grained video understanding, however, they struggle with fine-grained temporal …
Task preference optimization: Improving multimodal large language models with vision task alignment
Current multimodal large language models (MLLMs) struggle with fine-grained or precise
understanding of visuals though they give comprehensive perception and reasoning in a …
understanding of visuals though they give comprehensive perception and reasoning in a …
Do language models understand time?
Large language models (LLMs) have revolutionized video-based computer vision
applications, including action recognition, anomaly detection, and video summarization …
applications, including action recognition, anomaly detection, and video summarization …
Flaash: Flow-attention adaptive semantic hierarchical fusion for multi-modal tobacco content analysis
The proliferation of tobacco-related content on social media platforms poses significant
challenges for public health monitoring and intervention. This paper introduces a novel multi …
challenges for public health monitoring and intervention. This paper introduces a novel multi …
Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis
Recent advances in diffusion models have demonstrated exceptional capabilities in image
and video generation, further improving the effectiveness of 4D synthesis. Existing 4D …
and video generation, further improving the effectiveness of 4D synthesis. Existing 4D …
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Large Vision-Language Models (VLMs) have been extended to understand both images
and videos. Visual token compression is leveraged to reduce the considerable token length …
and videos. Visual token compression is leveraged to reduce the considerable token length …