VideoAgent: Long-Form Video Understanding with Large Language Model as Agent
Long-form video understanding represents a significant challenge within computer vision,
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …
Video understanding with large language models: A survey
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …
content, the demand for proficient video understanding tools has intensified markedly. Given …
The (r) evolution of multimodal large language models: A survey
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …
this reason, inspired by the success of large language models, significant research efforts …
Slowfast-llava: A strong training-free baseline for video large language models
We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language
model (LLM) that can jointly capture detailed spatial semantics and long-range temporal …
model (LLM) that can jointly capture detailed spatial semantics and long-range temporal …
Knowledge-enhanced dual-stream zero-shot composed image retrieval
We study the zero-shot Composed Image Retrieval (ZS-CIR) task which is to retrieve the
target image given a reference image and a description without training on the triplet …
target image given a reference image and a description without training on the triplet …
Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios
This paper focuses on the challenge of answering questions in scenarios that are composed
of rich and complex dynamic audio-visual components. Although existing Multimodal Large …
of rich and complex dynamic audio-visual components. Although existing Multimodal Large …
Sifu: Side-view conditioned implicit function for real-world usable clothed human reconstruction
Creating high-quality 3D models of clothed humans from single images for real-world
applications is crucial. Despite recent advancements accurately reconstructing humans in …
applications is crucial. Despite recent advancements accurately reconstructing humans in …
Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning
We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …
to enhance capabilities in text-rich image understanding, visual referring and grounding …
Pllava: Parameter-free llava extension from images to videos for video dense captioning
Vision-language pre-training has significantly elevated performance across a wide range of
image-language applications. Yet, the pre-training process for video-related tasks demands …
image-language applications. Yet, the pre-training process for video-related tasks demands …
Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models
Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in
coarse-grained video understanding, however, they struggle with fine-grained temporal …
coarse-grained video understanding, however, they struggle with fine-grained temporal …