Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Videotree: Adaptive tree-based video representation for llm reasoning on long videos

Z Wang, S Yu, E Stengel-Eskin, J Yoon… - arxiv preprint arxiv …, 2024 - arxiv.org
Long-form video understanding has been a challenging task due to the high redundancy in
video data and the abundance of query-irrelevant information. To tackle this challenge, we …

Vamos: Versatile action models for video understanding

S Wang, Q Zhao, MQ Do, N Agarwal, K Lee… - European Conference on …, 2024 - Springer
What makes good representations for video understanding, such as anticipating future
activities, or answering video-conditioned questions? While earlier approaches focus on …

Language repository for long video understanding

K Kahatapitiya, K Ranasinghe, J Park… - arxiv preprint arxiv …, 2024 - arxiv.org
Language has become a prominent modality in computer vision with the rise of LLMs.
Despite supporting long context-lengths, their effectiveness in handling long-term …

TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment

W Li, H Fan, Y Wong… - Advances in Neural …, 2025 - proceedings.neurips.cc
Recent advancements in image understanding have benefited from the extensive use of
web image-text pairs. However, video understanding remains a challenge despite the …

Videoqa in the era of llms: An empirical study

J **ao, N Huang, H Qin, D Li, Y Li, F Zhu, Z Tao… - International Journal of …, 2025 - Springer
Abstract Video Large Language Models (Video-LLMs) are flourishing and has advanced
many video-language tasks. As a golden testbed, Video Question Answering (VideoQA) …

VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs

R Liao, M Erler, H Wang, G Zhai, G Zhang, Y Ma… - arxiv preprint arxiv …, 2024 - arxiv.org
In the video-language domain, recent works in leveraging zero-shot Large Language Model-
based reasoning for video understanding have become competitive challengers to previous …

Drvideo: Document retrieval based long video understanding

Z Ma, C Gou, H Shi, B Sun, S Li, H Rezatofighi… - arxiv preprint arxiv …, 2024 - arxiv.org
Existing methods for long video understanding primarily focus on videos only lasting tens of
seconds, with limited exploration of techniques for handling longer videos. The increased …

Too many frames, not all useful: Efficient strategies for long-form video qa

J Park, K Ranasinghe, K Kahatapitiya, W Ryoo… - arxiv preprint arxiv …, 2024 - arxiv.org
Long-form videos that span across wide temporal intervals are highly information redundant
and contain multiple distinct events or entities that are often loosely related. Therefore, when …

Episodic memory verbalization using hierarchical representations of life-long robot experience

L Bärmann, C DeChant, J Plewnia… - arxiv preprint arxiv …, 2024 - arxiv.org
Verbalization of robot experience, ie, summarization of and question answering about a
robot's past, is a crucial ability for improving human-robot interaction. Previous works …