Llama-vid: An image is worth 2 tokens in large language models

Y Li, C Wang, J Jia - European Conference on Computer Vision, 2024 - Springer
In this work, we present a novel method to tackle the token generation challenge in Vision
Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current …

Timechat: A time-sensitive multimodal large language model for long video understanding

S Ren, L Yao, S Li, X Sun… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
This work proposes TimeChat a time-sensitive multimodal large language model specifically
designed for long video understanding. Our model incorporates two key architectural …

Ma-lmm: Memory-augmented large multimodal model for long-term video understanding

B He, H Li, YK Jang, M Jia, X Cao… - Proceedings of the …, 2024 - openaccess.thecvf.com
With the success of large language models (LLMs) integrating the vision model into LLMs to
build vision-language foundation models has gained much more interest recently. However …

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

C Fu, Y Dai, Y Luo, L Li, S Ren, R Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs)
have emerged as a focal point in recent advancements. However, the predominant focus …

Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms

Z Cheng, S Leng, H Zhang, Y **n, X Li, G Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-
LLMs) designed to enhance spatial-temporal modeling and audio understanding in video …

Large models for time series and spatio-temporal data: A survey and outlook

M **, Q Wen, Y Liang, C Zhang, S Xue, X Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
Temporal data, notably time series and spatio-temporal data, are prevalent in real-world
applications. They capture dynamic system measurements and are produced in vast …

Streaming dense video captioning

X Zhou, A Arnab, S Buch, S Yan… - Proceedings of the …, 2024 - openaccess.thecvf.com
An ideal model for dense video captioning--predicting captions localized temporally in a
video--should be able to handle long input videos predict rich detailed textual descriptions …

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

P Zhang, X Dong, Y Zang, Y Cao, R Qian… - arxiv preprint arxiv …, 2024 - arxiv.org
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …

Pllava: Parameter-free llava extension from images to videos for video dense captioning

L Xu, Y Zhao, D Zhou, Z Lin, SK Ng, J Feng - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-language pre-training has significantly elevated performance across a wide range of
image-language applications. Yet, the pre-training process for video-related tasks demands …

Streaming long video understanding with large language models

R Qian, X Dong, P Zhang, Y Zang… - Advances in …, 2025 - proceedings.neurips.cc
This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for
video understanding, that capably understands arbitrary-length video with a constant …