- Academic Search

Y Li, C Wang, J Jia - European Conference on Computer Vision, 2024 - Springer

In this work, we present a novel method to tackle the token generation challenge in Vision
Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current …

Save Cite Cited by 205 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

P Zhang, X Dong, Y Zang, Y Cao, R Qian… - arxiv preprint arxiv …, 2024 - arxiv.org

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …

Save Cite Cited by 69 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

VideoAgent: Long-Form Video Understanding with Large Language Model as Agent

X Wang, Y Zhang, O Zohar, S Yeung-Levy - European Conference on …, 2024 - Springer

Long-form video understanding represents a significant challenge within computer vision,
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …

Save Cite Cited by 40 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding

Y Fan, X Ma, R Wu, Y Du, J Li, Z Gao, Q Li - European Conference on …, 2024 - Springer

We explore how reconciling several foundation models (large language models and vision-
language models) with a novel unified memory mechanism could tackle the challenging …

Save Cite Cited by 32 Related articles All 4 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Save Cite Cited by 60 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Slowfast-llava: A strong training-free baseline for video large language models

M Xu, M Gao, Z Gan, HY Chen, Z Lai, H Gang… - arxiv preprint arxiv …, 2024 - arxiv.org

We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language
model (LLM) that can jointly capture detailed spatial semantics and long-range temporal …

Save Cite Cited by 30 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Cinepile: A long video question answering dataset and benchmark

R Rawal, K Saifullah, M Farré, R Basri… - arxiv preprint arxiv …, 2024 - arxiv.org

Current datasets for long-form video understanding often fall short of providing genuine long-
form comprehension challenges, as many tasks derived from these datasets can be …

Save Cite Cited by 23 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Hourvideo: 1-hour video-language understanding

K Chandrasegaran, A Gupta, LM Hadzic, T Kota… - arxiv preprint arxiv …, 2024 - arxiv.org

We present HourVideo, a benchmark dataset for hour-long video-language understanding.
Our dataset consists of a novel task suite comprising summarization, perception (recall …

Save Cite Cited by 11 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Vamos: Versatile action models for video understanding

S Wang, Q Zhao, MQ Do, N Agarwal, K Lee… - European Conference on …, 2024 - Springer

What makes good representations for video understanding, such as anticipating future
activities, or answering video-conditioned questions? While earlier approaches focus on …

Save Cite Cited by 12 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Pllava: Parameter-free llava extension from images to videos for video dense captioning

L Xu, Y Zhao, D Zhou, Z Lin, SK Ng, J Feng - arxiv preprint arxiv …, 2024 - arxiv.org

Vision-language pre-training has significantly elevated performance across a wide range of
image-language applications. Yet, the pre-training process for video-related tasks demands …

Save Cite Cited by 95 Related articles All 2 versions Free GPT-4 View as HTML

Create alert

Cite

Advanced search

Saved to My library

A simple llm framework for long-range video question-answering

Llama-vid: An image is worth 2 tokens in large language models

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

VideoAgent: Long-Form Video Understanding with Large Language Model as Agent

VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding

Video understanding with large language models: A survey

Slowfast-llava: A strong training-free baseline for video large language models

Cinepile: A long video question answering dataset and benchmark

Hourvideo: 1-hour video-language understanding

Vamos: Versatile action models for video understanding

Pllava: Parameter-free llava extension from images to videos for video dense captioning