Llama-vid: An image is worth 2 tokens in large language models
In this work, we present a novel method to tackle the token generation challenge in Vision
Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current …
Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current …
Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …
supports long-contextual input and output. IXC-2.5 excels in various text-image …
VideoAgent: Long-Form Video Understanding with Large Language Model as Agent
Long-form video understanding represents a significant challenge within computer vision,
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …
VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding
We explore how reconciling several foundation models (large language models and vision-
language models) with a novel unified memory mechanism could tackle the challenging …
language models) with a novel unified memory mechanism could tackle the challenging …
Video understanding with large language models: A survey
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …
content, the demand for proficient video understanding tools has intensified markedly. Given …
Slowfast-llava: A strong training-free baseline for video large language models
We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language
model (LLM) that can jointly capture detailed spatial semantics and long-range temporal …
model (LLM) that can jointly capture detailed spatial semantics and long-range temporal …
Cinepile: A long video question answering dataset and benchmark
Current datasets for long-form video understanding often fall short of providing genuine long-
form comprehension challenges, as many tasks derived from these datasets can be …
form comprehension challenges, as many tasks derived from these datasets can be …
Hourvideo: 1-hour video-language understanding
We present HourVideo, a benchmark dataset for hour-long video-language understanding.
Our dataset consists of a novel task suite comprising summarization, perception (recall …
Our dataset consists of a novel task suite comprising summarization, perception (recall …
Vamos: Versatile action models for video understanding
What makes good representations for video understanding, such as anticipating future
activities, or answering video-conditioned questions? While earlier approaches focus on …
activities, or answering video-conditioned questions? While earlier approaches focus on …
Pllava: Parameter-free llava extension from images to videos for video dense captioning
Vision-language pre-training has significantly elevated performance across a wide range of
image-language applications. Yet, the pre-training process for video-related tasks demands …
image-language applications. Yet, the pre-training process for video-related tasks demands …