Video instruction tuning with synthetic data

Y Zhang, J Wu, W Li, B Li, Z Ma, Z Liu, C Li - arxiv preprint arxiv …, 2024 - arxiv.org
The development of video large multimodal models (LMMs) has been hindered by the
difficulty of curating large amounts of high-quality raw data from the web. To address this, we …

Longvila: Scaling long-context visual language models for long videos

F Xue, Y Chen, D Li, Q Hu, L Zhu, X Li, Y Fang… - arxiv preprint arxiv …, 2024 - arxiv.org
Long-context capability is critical for multi-modal foundation models, especially for long
video understanding. We introduce LongVILA, a full-stack solution for long-context visual …

Longvlm: Efficient long video understanding via large language models

Y Weng, M Han, H He, X Chang, B Zhuang - European Conference on …, 2024 - Springer
Abstract Empowered by Large Language Models (LLMs), recent advancements in Video-
based LLMs (VideoLLMs) have driven progress in various video understanding tasks. These …

Mlp can be a good transformer learner

S Lin, P Lyu, D Liu, T Tang, X Liang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Self-attention mechanism is the key of the Transformer but often criticized for its computation
demands. Previous token pruning works motivate their methods from the view of …

V2xum-llm: Cross-modal video summarization with temporal prompt instruction tuning

H Hua, Y Tang, C Xu, J Luo - arxiv preprint arxiv:2404.12353, 2024 - arxiv.org
Video summarization aims to create short, accurate, and cohesive summaries of longer
videos. Despite the existence of various video summarization datasets, a notable limitation …

VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models

J Wang, C Wang, K Huang, J Huang, L ** - arxiv preprint arxiv …, 2024 - arxiv.org
Contrastive Language-Image Pre-training (CLIP) has been widely studied and applied in
numerous applications. However, the emphasis on brief summary texts during pre-training …

Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies

Y Gao, L Fischer, A Lintner, S Ebling - arxiv preprint arxiv:2410.08860, 2024 - arxiv.org
Audio descriptions (ADs) function as acoustic commentaries designed to assist blind
persons and persons with visual impairments in accessing digital media content on …

What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

D Liu, C Whitehouse, X Yu, L Mahon, R Saxena… - arxiv preprint arxiv …, 2025 - arxiv.org
Transforming recorded videos into concise and accurate textual summaries is a growing
challenge in multimodal learning. This paper introduces VISTA, a dataset specifically …

COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation

X Deng, Q Yu, A Athar, C Yang, L Yang, X **… - arxiv preprint arxiv …, 2025 - arxiv.org
This paper introduces the COCONut-PanCap dataset, created to enhance panoptic
segmentation and grounded image captioning. Building upon the COCO dataset with …

Shotluck holmes: A family of efficient small-scale large language vision models for video captioning and summarization

R Luo, A Peng, A Vasudev, R Jain - … of the 2nd International Workshop on …, 2024 - dl.acm.org
Video is an increasingly prominent and information-dense medium, yet it poses substantial
challenges for language models. A typical video consists of a sequence of shorter segments …