Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Slowfast-llava: A strong training-free baseline for video large language models

M Xu, M Gao, Z Gan, HY Chen, Z Lai, H Gang… - arxiv preprint arxiv …, 2024 - arxiv.org
We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language
model (LLM) that can jointly capture detailed spatial semantics and long-range temporal …

Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution

Z Liu, Y Dong, Z Liu, W Hu, J Lu, Y Rao - arxiv preprint arxiv:2409.12961, 2024 - arxiv.org
Visual data comes in various forms, ranging from small icons of just a few pixels to long
videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual …

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models

H Wang, Z Xu, Y Cheng, S Diao, Y Zhou, Y Cao… - arxiv preprint arxiv …, 2024 - arxiv.org
Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in
coarse-grained video understanding, however, they struggle with fine-grained temporal …

Task preference optimization: Improving multimodal large language models with vision task alignment

Z Yan, Z Li, Y He, C Wang, K Li, X Li, X Zeng… - arxiv preprint arxiv …, 2024 - arxiv.org
Current multimodal large language models (MLLMs) struggle with fine-grained or precise
understanding of visuals though they give comprehensive perception and reasoning in a …

Do language models understand time?

X Ding, L Wang - arxiv preprint arxiv:2412.13845, 2024 - arxiv.org
Large language models (LLMs) have revolutionized video-based computer vision
applications, including action recognition, anomaly detection, and video summarization …

Flaash: Flow-attention adaptive semantic hierarchical fusion for multi-modal tobacco content analysis

NVS Chappa, PD Dobbs, B Raj, K Luu - arxiv preprint arxiv:2410.19896, 2024 - arxiv.org
The proliferation of tobacco-related content on social media platforms poses significant
challenges for public health monitoring and intervention. This paper introduces a novel multi …

Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis

B Zeng, L Yang, S Li, J Liu, Z Zhang, J Tian… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advances in diffusion models have demonstrated exceptional capabilities in image
and video generation, further improving the effectiveness of 4D synthesis. Existing 4D …

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

C Yang, X Dong, X Zhu, W Su, J Wang, H Tian… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Vision-Language Models (VLMs) have been extended to understand both images
and videos. Visual token compression is leveraged to reduce the considerable token length …