Videotree: Adaptive tree-based video representation for llm reasoning on long videos

Z Wang, S Yu, E Stengel-Eskin, J Yoon… - arxiv preprint arxiv …, 2024 - arxiv.org
Long-form video understanding has been a challenging task due to the high redundancy in
video data and the abundance of query-irrelevant information. To tackle this challenge, we …

Language repository for long video understanding

K Kahatapitiya, K Ranasinghe, J Park… - arxiv preprint arxiv …, 2024 - arxiv.org
Language has become a prominent modality in computer vision with the rise of LLMs.
Despite supporting long context-lengths, their effectiveness in handling long-term …

TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment

W Li, H Fan, Y Wong… - Advances in Neural …, 2025 - proceedings.neurips.cc
Recent advancements in image understanding have benefited from the extensive use of
web image-text pairs. However, video understanding remains a challenge despite the …

Coarse correspondence elicit 3d spacetime understanding in multimodal language model

B Liu, Y Dong, Y Wang, Y Rao, Y Tang, WC Ma… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal language models (MLLMs) are increasingly being implemented in real-world
environments, necessitating their ability to interpret 3D spaces and comprehend temporal …

Videoqa in the era of llms: An empirical study

J **ao, N Huang, H Qin, D Li, Y Li, F Zhu, Z Tao… - International Journal of …, 2025 - Springer
Abstract Video Large Language Models (Video-LLMs) are flourishing and has advanced
many video-language tasks. As a golden testbed, Video Question Answering (VideoQA) …

AMEGO: Active Memory from long EGOcentric videos

G Goletto, T Nagarajan, G Averta, D Damen - European Conference on …, 2024 - Springer
Egocentric videos provide a unique perspective into individuals' daily experiences, yet their
unstructured nature presents challenges for perception. In this paper, we introduce AMEGO …