Lvbench: An extreme long video understanding benchmark
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks
A practical navigation agent must be capable of handling a wide range of interaction
demands, such as following instructions, searching objects, answering questions, tracking …
demands, such as following instructions, searching objects, answering questions, tracking …
MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning
In recent years, multimodal benchmarks for general domains have guided the rapid
development of multimodal models on general tasks. However, the financial field has its …
development of multimodal models on general tasks. However, the financial field has its …