Aligning cyber space with physical world: A comprehensive survey on embodied ai

Y Liu, W Chen, Y Bai, X Liang, G Li, W Gao… - arxiv preprint arxiv …, 2024 - arxiv.org
Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial General
Intelligence (AGI) and serves as a foundation for various applications that bridge cyberspace …

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

S Tong, E Brown, P Wu, S Woo, M Middepogu… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

A survey on benchmarks of multimodal large language models

J Li, W Lu, H Fei, M Luo, M Dai, M **a, Y **… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) are gaining increasing popularity in both
academia and industry due to their remarkable performance in various applications such as …

A survey on evaluation of multimodal large language models

J Huang, J Zhang - arxiv preprint arxiv:2408.15769, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) mimic human perception and reasoning
system by integrating powerful Large Language Models (LLMs) with various modality …

Hourvideo: 1-hour video-language understanding

K Chandrasegaran, A Gupta, LM Hadzic, T Kota… - arxiv preprint arxiv …, 2024 - arxiv.org
We present HourVideo, a benchmark dataset for hour-long video-language understanding.
Our dataset consists of a novel task suite comprising summarization, perception (recall …

Coarse correspondence elicit 3d spacetime understanding in multimodal language model

B Liu, Y Dong, Y Wang, Y Rao, Y Tang, WC Ma… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal language models (MLLMs) are increasingly being implemented in real-world
environments, necessitating their ability to interpret 3D spaces and comprehend temporal …

Continual llava: Continual instruction tuning in large vision-language models

M Cao, Y Liu, Y Liu, T Wang, J Dong, H Ding… - arxiv preprint arxiv …, 2024 - arxiv.org
Instruction tuning constitutes a prevalent technique for tailoring Large Vision Language
Models (LVLMs) to meet individual task requirements. To date, most of the existing …

Does Spatial Cognition Emerge in Frontier Models?

SK Ramakrishnan, E Wijmans, P Kraehenbuehl… - arxiv preprint arxiv …, 2024 - arxiv.org
Not yet. We present SPACE, a benchmark that systematically evaluates spatial cognition in
frontier models. Our benchmark builds on decades of research in cognitive science. It …

Multi-modal situated reasoning in 3d scenes

X Linghu, J Huang, X Niu, X Ma, B Jia… - arxiv preprint arxiv …, 2024 - arxiv.org
Situation awareness is essential for understanding and reasoning about 3D scenes in
embodied AI agents. However, existing datasets and benchmarks for situated understanding …

Vlm-grounder: A vlm agent for zero-shot 3d visual grounding

R Xu, Z Huang, T Wang, Y Chen, J Pang… - arxiv preprint arxiv …, 2024 - arxiv.org
3D visual grounding is crucial for robots, requiring integration of natural language and 3D
scene understanding. Traditional methods depending on supervised learning with 3D point …