- Academic Search

X Wang, Y Zhang, O Zohar, S Yeung-Levy - European Conference on …, 2024 - Springer

Long-form video understanding represents a significant challenge within computer vision,
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …

Enregistrer Citer Cité 40 fois Autres articles Les 2 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Enregistrer Citer Cité 58 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

The (r) evolution of multimodal large language models: A survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arxiv preprint arxiv …, 2024 - arxiv.org

Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

Enregistrer Citer Cité 43 fois Autres articles Les 4 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Slowfast-llava: A strong training-free baseline for video large language models

M Xu, M Gao, Z Gan, HY Chen, Z Lai, H Gang… - arxiv preprint arxiv …, 2024 - arxiv.org

We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language
model (LLM) that can jointly capture detailed spatial semantics and long-range temporal …

Enregistrer Citer Cité 29 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] thecvf.com

Knowledge-enhanced dual-stream zero-shot composed image retrieval

Y Suo, F Ma, L Zhu, Y Yang - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com

We study the zero-shot Composed Image Retrieval (ZS-CIR) task which is to retrieve the
target image given a reference image and a description without training on the triplet …

Enregistrer Citer Cité 13 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios

Q Ye, Z Yu, R Shao, X **e, P Torr, X Cao - European Conference on …, 2024 - Springer

This paper focuses on the challenge of answering questions in scenarios that are composed
of rich and complex dynamic audio-visual components. Although existing Multimodal Large …

Enregistrer Citer Cité 12 fois Autres articles Les 4 versions Free GPT-4

[Free GPT-4]

[PDF] thecvf.com

Sifu: Side-view conditioned implicit function for real-world usable clothed human reconstruction

Z Zhang, Z Yang, Y Yang - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com

Creating high-quality 3D models of clothed humans from single images for real-world
applications is crucial. Despite recent advancements accurately reconstructing humans in …

Enregistrer Citer Cité 25 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning

H Zhang, M Gao, Z Gan, P Dufter, N Wenzel… - arxiv preprint arxiv …, 2024 - arxiv.org

We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …

Enregistrer Citer Cité 15 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Pllava: Parameter-free llava extension from images to videos for video dense captioning

L Xu, Y Zhao, D Zhou, Z Lin, SK Ng, J Feng - arxiv preprint arxiv …, 2024 - arxiv.org

Vision-language pre-training has significantly elevated performance across a wide range of
image-language applications. Yet, the pre-training process for video-related tasks demands …

Enregistrer Citer Cité 94 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models

H Wang, Z Xu, Y Cheng, S Diao, Y Zhou, Y Cao… - arxiv preprint arxiv …, 2024 - arxiv.org

Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in
coarse-grained video understanding, however, they struggle with fine-grained temporal …

Enregistrer Citer Cité 7 fois Autres articles Les 2 versions Free GPT-4 Version HTML

Citer

Recherche avancée

Enregistré dans Ma bibliothèque

VideoAgent: Long-Form Video Understanding with Large Language Model as Agent

Video understanding with large language models: A survey

The (r) evolution of multimodal large language models: A survey

Slowfast-llava: A strong training-free baseline for video large language models

Knowledge-enhanced dual-stream zero-shot composed image retrieval

Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios

Sifu: Side-view conditioned implicit function for real-world usable clothed human reconstruction

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning

Pllava: Parameter-free llava extension from images to videos for video dense captioning

Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models