Ma-lmm: Memory-augmented large multimodal model for long-term video understanding

B He, H Li, YK Jang, M Jia, X Cao… - Proceedings of the …, 2024 - openaccess.thecvf.com
With the success of large language models (LLMs) integrating the vision model into LLMs to
build vision-language foundation models has gained much more interest recently. However …

Omnitokenizer: A joint image-video tokenizer for visual generation

J Wang, Y Jiang, Z Yuan, B Peng… - Advances in Neural …, 2025 - proceedings.neurips.cc
Tokenizer, serving as a translator to map the intricate visual data into a compact latent
space, lies at the core of visual generative models. Based on the finding that existing …

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Exploring pre-trained text-to-video diffusion models for referring video object segmentation

Z Zhu, X Feng, D Chen, J Yuan, C Qiao… - European Conference on …, 2024 - Springer
In this paper, we explore the visual representations produced from a pre-trained text-to-
video (T2V) diffusion model for video understanding tasks. We hypothesize that the latent …

Trafficvlm: A controllable visual language model for traffic video captioning

QM Dinh, MK Ho, AQ Dang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Traffic video description and analysis have received much attention recently due to the
growing demand for efficient and reliable urban surveillance systems. Most existing methods …

Aid: Adapting image2video diffusion models for instruction-guided video prediction

Z **ng, Q Dai, Z Weng, Z Wu, YG Jiang - arxiv preprint arxiv:2406.06465, 2024 - arxiv.org
Text-guided video prediction (TVP) involves predicting the motion of future frames from the
initial frame according to an instruction, which has wide applications in virtual reality …

OmniTracker: Unifying Visual Object Tracking by Tracking-with-Detection

J Wang, Z Wu, D Chen, C Luo, X Dai… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org
Visual Object Tracking (VOT) aims to estimate the positions of target objects in a video
sequence, which is an important vision task with various real-world applications. Depending …

Do language models understand time?

X Ding, L Wang - arxiv preprint arxiv:2412.13845, 2024 - arxiv.org
Large language models (LLMs) have revolutionized video-based computer vision
applications, including action recognition, anomaly detection, and video summarization …

EIKA: Explicit & Implicit Knowledge-Augmented Network for entity-aware sports video captioning

Z **, G Shi, H Sun, B Zhang, S Li, L Wu - Expert Systems with Applications, 2025 - Elsevier
Sports video captioning in real application scenarios requires both entities and specific
scenes. However, it is difficult to extract this fine-grained information solely from the video …

A simple yet effective knowledge guided method for entity-aware video captioning on a basketball benchmark

Z **, G Shi, X Li, J Yan, Z Li, L Wu, Z Liu, L Wang - Neurocomputing, 2025 - Elsevier
Despite the recent emergence of video captioning models, how to generate the text
description with specific entity names and fine-grained actions is far from being solved …