Lvbench: An extreme long video understanding benchmark

W Wang, Z He, W Hong, Y Cheng, X Zhang, J Qi… - ar** Manga
Y Wu, X Hu, Y Sun, Y Zhou, W Zhu, F Rao… - arxiv preprint arxiv …, 2024 - arxiv.org
Video Large Language Models (Vid-LLMs) have made remarkable advancements in
comprehending video content for QA dialogue. However, they struggle to extend this visual …

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

J Zhang, K Wang, S Wang, M Li, H Liu, S Wei… - arxiv preprint arxiv …, 2024 - arxiv.org
A practical navigation agent must be capable of handling a wide range of interaction
demands, such as following instructions, searching objects, answering questions, tracking …

MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning

Z Gan, Y Lu, D Zhang, H Li, C Liu, J Liu, J Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, multimodal benchmarks for general domains have guided the rapid
development of multimodal models on general tasks. However, the financial field has its …