Aligning cyber space with physical world: A comprehensive survey on embodied ai
Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial General
Intelligence (AGI) and serves as a foundation for various applications that bridge cyberspace …
Intelligence (AGI) and serves as a foundation for various applications that bridge cyberspace …
Cambrian-1: A fully open, vision-centric exploration of multimodal llms
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …
centric approach. While stronger language models can enhance multimodal capabilities, the …
A survey on benchmarks of multimodal large language models
Multimodal Large Language Models (MLLMs) are gaining increasing popularity in both
academia and industry due to their remarkable performance in various applications such as …
academia and industry due to their remarkable performance in various applications such as …
A survey on evaluation of multimodal large language models
J Huang, J Zhang - arxiv preprint arxiv:2408.15769, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) mimic human perception and reasoning
system by integrating powerful Large Language Models (LLMs) with various modality …
system by integrating powerful Large Language Models (LLMs) with various modality …
Hourvideo: 1-hour video-language understanding
We present HourVideo, a benchmark dataset for hour-long video-language understanding.
Our dataset consists of a novel task suite comprising summarization, perception (recall …
Our dataset consists of a novel task suite comprising summarization, perception (recall …
Coarse correspondence elicit 3d spacetime understanding in multimodal language model
Multimodal language models (MLLMs) are increasingly being implemented in real-world
environments, necessitating their ability to interpret 3D spaces and comprehend temporal …
environments, necessitating their ability to interpret 3D spaces and comprehend temporal …
Continual llava: Continual instruction tuning in large vision-language models
Instruction tuning constitutes a prevalent technique for tailoring Large Vision Language
Models (LVLMs) to meet individual task requirements. To date, most of the existing …
Models (LVLMs) to meet individual task requirements. To date, most of the existing …
Does Spatial Cognition Emerge in Frontier Models?
Not yet. We present SPACE, a benchmark that systematically evaluates spatial cognition in
frontier models. Our benchmark builds on decades of research in cognitive science. It …
frontier models. Our benchmark builds on decades of research in cognitive science. It …
Multi-modal situated reasoning in 3d scenes
Situation awareness is essential for understanding and reasoning about 3D scenes in
embodied AI agents. However, existing datasets and benchmarks for situated understanding …
embodied AI agents. However, existing datasets and benchmarks for situated understanding …
Vlm-grounder: A vlm agent for zero-shot 3d visual grounding
3D visual grounding is crucial for robots, requiring integration of natural language and 3D
scene understanding. Traditional methods depending on supervised learning with 3D point …
scene understanding. Traditional methods depending on supervised learning with 3D point …