- Academic Search

W Wang, Z He, W Hong, Y Cheng, X Zhang, J Qi… - ar** Manga

Y Wu, X Hu, Y Sun, Y Zhou, W Zhu, F Rao… - arxiv preprint arxiv …, 2024 - arxiv.org

Video Large Language Models (Vid-LLMs) have made remarkable advancements in
comprehending video content for QA dialogue. However, they struggle to extend this visual …

Enregistrer Citer Cité 2 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

J Zhang, K Wang, S Wang, M Li, H Liu, S Wei… - arxiv preprint arxiv …, 2024 - arxiv.org

A practical navigation agent must be capable of handling a wide range of interaction
demands, such as following instructions, searching objects, answering questions, tracking …

Enregistrer Citer Cité 2 fois Autres articles Version HTML

[Free GPT-4]

[PDF] arxiv.org

MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning

Z Gan, Y Lu, D Zhang, H Li, C Liu, J Liu, J Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

In recent years, multimodal benchmarks for general domains have guided the rapid
development of multimodal models on general tasks. However, the financial field has its …

Enregistrer Citer Cité 1 fois Autres articles Les 2 versions Free GPT-4 Version HTML

Créer l'alerte

Citer

Recherche avancée

Enregistré dans Ma bibliothèque

Cogvlm2: Visual language models for image and video understanding

Lvbench: An extreme long video understanding benchmark

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning