Streaming long video understanding with large language models

R Qian, X Dong, P Zhang, Y Zang… - Advances in …, 2025 - proceedings.neurips.cc
This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for
video understanding, that capably understands arbitrary-length video with a constant …

A simple llm framework for long-range video question-answering

C Zhang, T Lu, MM Islam, Z Wang, S Yu… - arxiv preprint arxiv …, 2023 - arxiv.org
We present LLoVi, a language-based framework for long-range video question-answering
(LVQA). Unlike prior long-range video understanding methods, which are often costly and …

VideoAgent: Long-Form Video Understanding with Large Language Model as Agent

X Wang, Y Zhang, O Zohar, S Yeung-Levy - European Conference on …, 2024 - Springer
Long-form video understanding represents a significant challenge within computer vision,
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …

Towards generalist robot learning from internet video: A survey

R McCarthy, DCH Tan, D Schmidt, F Acero… - arxiv preprint arxiv …, 2024 - arxiv.org
Scaling deep learning to massive, diverse internet data has yielded remarkably general
capabilities in visual and natural language understanding and generation. However, data …

Anymal: An efficient and scalable any-modality augmented language model

S Moon, A Madotto, Z Lin, T Nagarajan… - Proceedings of the …, 2024 - aclanthology.org
Abstract We present Any-Modality Augmented Language Model (AnyMAL), a unified model
that reasons over diverse input modality signals (ie text, image, video, audio, IMU motion …

Language repository for long video understanding

K Kahatapitiya, K Ranasinghe, J Park… - arxiv preprint arxiv …, 2024 - arxiv.org
Language has become a prominent modality in computer vision with the rise of LLMs.
Despite supporting long context-lengths, their effectiveness in handling long-term …

Memory consolidation enables long-context video understanding

I Balažević, Y Shi, P Papalampidi, R Chaabouni… - arxiv preprint arxiv …, 2024 - arxiv.org
Most transformer-based video encoders are limited to short temporal contexts due to their
quadratic complexity. While various attempts have been made to extend this context, this …

TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment

W Li, H Fan, Y Wong… - Advances in Neural …, 2025 - proceedings.neurips.cc
Recent advancements in image understanding have benefited from the extensive use of
web image-text pairs. However, video understanding remains a challenge despite the …

Videollamb: Long-context video understanding with recurrent memory bridges

Y Wang, C **e, Y Liu, Z Zheng - arxiv preprint arxiv:2409.01071, 2024 - arxiv.org
Recent advancements in large-scale video-language models have shown significant
potential for real-time planning and detailed interactions. However, their high computational …

Drvideo: Document retrieval based long video understanding

Z Ma, C Gou, H Shi, B Sun, S Li, H Rezatofighi… - arxiv preprint arxiv …, 2024 - arxiv.org
Existing methods for long video understanding primarily focus on videos only lasting tens of
seconds, with limited exploration of techniques for handling longer videos. The increased …