Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation

S Wu, J Chen, KQ Lin, Q Wang, Y Gao… - Advances in …, 2025 - proceedings.neurips.cc
A well-known dilemma in large vision-language models (eg, GPT-4, LLaVA) is that while
increasing the number of vision tokens generally enhances visual understanding, it also …

Diff-tracker: text-to-image diffusion models are unsupervised trackers

Z Zhang, L Xu, D Peng, H Rahmani, J Liu - European Conference on …, 2024 - Springer
Abstract We introduce Diff-Tracker, a novel approach for the challenging unsupervised
visual tracking task leveraging the pre-trained text-to-image diffusion model. Our main idea …

Learning video context as interleaved multimodal sequences

KQ Lin, P Zhang, D Gao, X **a, J Chen, Z Gao… - … on Computer Vision, 2024 - Springer
Narrative videos, such as movies, pose significant challenges in video understanding due to
their rich contexts (characters, dialogues, storylines) and diverse demands (identify who …

Videollamb: Long-context video understanding with recurrent memory bridges

Y Wang, C **e, Y Liu, Z Zheng - arxiv preprint arxiv:2409.01071, 2024 - arxiv.org
Recent advancements in large-scale video-language models have shown significant
potential for real-time planning and detailed interactions. However, their high computational …

Do language models understand time?

X Ding, L Wang - arxiv preprint arxiv:2412.13845, 2024 - arxiv.org
Large language models (LLMs) have revolutionized video-based computer vision
applications, including action recognition, anomaly detection, and video summarization …

Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model

Y Huang, J Xu, B Pei, Y He, G Chen, L Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce Vinci, a real-time embodied smart assistant built upon an egocentric vision-
language model. Designed for deployment on portable devices such as smartphones and …

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

S Zhang, Q Fang, Z Yang, Y Feng - arxiv preprint arxiv:2501.03895, 2025 - arxiv.org
The advent of real-time large multimodal models (LMMs) like GPT-4o has sparked
considerable interest in efficient LMMs. LMM frameworks typically encode visual inputs into …

StreamChat: Chatting with Streaming Video

J Liu, Z Yu, S Lan, S Wang, R Fang, J Kautz… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper presents StreamChat, a novel approach that enhances the interaction
capabilities of Large Multimodal Models (LMMs) with streaming video content. In streaming …

[Књига][B] Computer Vision-ECCV 2024: 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XXIV.

A Leonardis, E Ricci, S Roth, O Russakovsky, T Sattler… - 2024 - books.google.com
The multi-volume set of LNCS books with volume numbers 15059 up to 15147 constitutes
the refereed proceedings of the 18th European Conference on Computer Vision, ECCV …