VideoAgent: Long-Form Video Understanding with Large Language Model as Agent

X Wang, Y Zhang, O Zohar, S Yeung-Levy - European Conference on …, 2024 - Springer
Long-form video understanding represents a significant challenge within computer vision,
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

The (r) evolution of multimodal large language models: A survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arxiv preprint arxiv …, 2024 - arxiv.org
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

Slowfast-llava: A strong training-free baseline for video large language models

M Xu, M Gao, Z Gan, HY Chen, Z Lai, H Gang… - arxiv preprint arxiv …, 2024 - arxiv.org
We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language
model (LLM) that can jointly capture detailed spatial semantics and long-range temporal …

Knowledge-enhanced dual-stream zero-shot composed image retrieval

Y Suo, F Ma, L Zhu, Y Yang - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com
We study the zero-shot Composed Image Retrieval (ZS-CIR) task which is to retrieve the
target image given a reference image and a description without training on the triplet …

Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios

Q Ye, Z Yu, R Shao, X **e, P Torr, X Cao - European Conference on …, 2024 - Springer
This paper focuses on the challenge of answering questions in scenarios that are composed
of rich and complex dynamic audio-visual components. Although existing Multimodal Large …

Sifu: Side-view conditioned implicit function for real-world usable clothed human reconstruction

Z Zhang, Z Yang, Y Yang - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Creating high-quality 3D models of clothed humans from single images for real-world
applications is crucial. Despite recent advancements accurately reconstructing humans in …

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning

H Zhang, M Gao, Z Gan, P Dufter, N Wenzel… - arxiv preprint arxiv …, 2024 - arxiv.org
We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …

Pllava: Parameter-free llava extension from images to videos for video dense captioning

L Xu, Y Zhao, D Zhou, Z Lin, SK Ng, J Feng - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-language pre-training has significantly elevated performance across a wide range of
image-language applications. Yet, the pre-training process for video-related tasks demands …

Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models

H Wang, Z Xu, Y Cheng, S Diao, Y Zhou, Y Cao… - arxiv preprint arxiv …, 2024 - arxiv.org
Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in
coarse-grained video understanding, however, they struggle with fine-grained temporal …