- Academic Search

Y Liu, W Chen, Y Bai, X Liang, G Li, W Gao… - arxiv preprint arxiv …, 2024 - arxiv.org

Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial General
Intelligence (AGI) and serves as a foundation for various applications that bridge cyberspace …

Save Cite Cited by 34 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

S Tong, E Brown, P Wu, S Woo, M Middepogu… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

Save Cite Cited by 172 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

A survey on benchmarks of multimodal large language models

J Li, W Lu, H Fei, M Luo, M Dai, M **a, Y **… - arxiv preprint arxiv …, 2024 - arxiv.org

Multimodal Large Language Models (MLLMs) are gaining increasing popularity in both
academia and industry due to their remarkable performance in various applications such as …

Save Cite Cited by 14 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

A survey on evaluation of multimodal large language models

J Huang, J Zhang - arxiv preprint arxiv:2408.15769, 2024 - arxiv.org

Multimodal Large Language Models (MLLMs) mimic human perception and reasoning
system by integrating powerful Large Language Models (LLMs) with various modality …

Save Cite Cited by 17 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Hourvideo: 1-hour video-language understanding

K Chandrasegaran, A Gupta, LM Hadzic, T Kota… - arxiv preprint arxiv …, 2024 - arxiv.org

We present HourVideo, a benchmark dataset for hour-long video-language understanding.
Our dataset consists of a novel task suite comprising summarization, perception (recall …

Save Cite Cited by 11 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Coarse correspondence elicit 3d spacetime understanding in multimodal language model

B Liu, Y Dong, Y Wang, Y Rao, Y Tang, WC Ma… - arxiv preprint arxiv …, 2024 - arxiv.org

Multimodal language models (MLLMs) are increasingly being implemented in real-world
environments, necessitating their ability to interpret 3D spaces and comprehend temporal …

Save Cite Cited by 10 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Continual llava: Continual instruction tuning in large vision-language models

M Cao, Y Liu, Y Liu, T Wang, J Dong, H Ding… - arxiv preprint arxiv …, 2024 - arxiv.org

Instruction tuning constitutes a prevalent technique for tailoring Large Vision Language
Models (LVLMs) to meet individual task requirements. To date, most of the existing …

Save Cite Cited by 6 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Does Spatial Cognition Emerge in Frontier Models?

SK Ramakrishnan, E Wijmans, P Kraehenbuehl… - arxiv preprint arxiv …, 2024 - arxiv.org

Not yet. We present SPACE, a benchmark that systematically evaluates spatial cognition in
frontier models. Our benchmark builds on decades of research in cognitive science. It …

Save Cite Cited by 3 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Multi-modal situated reasoning in 3d scenes

X Linghu, J Huang, X Niu, X Ma, B Jia… - arxiv preprint arxiv …, 2024 - arxiv.org

Situation awareness is essential for understanding and reasoning about 3D scenes in
embodied AI agents. However, existing datasets and benchmarks for situated understanding …

Save Cite Cited by 4 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Vlm-grounder: A vlm agent for zero-shot 3d visual grounding

R Xu, Z Huang, T Wang, Y Chen, J Pang… - arxiv preprint arxiv …, 2024 - arxiv.org

3D visual grounding is crucial for robots, requiring integration of natural language and 3D
scene understanding. Traditional methods depending on supervised learning with 3D point …

Save Cite Cited by 3 Related articles All 3 versions Free GPT-4 View as HTML

Create alert

Cite

Advanced search

Saved to My library

Openeqa: Embodied question answering in the era of foundation models

Aligning cyber space with physical world: A comprehensive survey on embodied ai

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

A survey on benchmarks of multimodal large language models

A survey on evaluation of multimodal large language models

Hourvideo: 1-hour video-language understanding

Coarse correspondence elicit 3d spacetime understanding in multimodal language model

Continual llava: Continual instruction tuning in large vision-language models

Does Spatial Cognition Emerge in Frontier Models?

Multi-modal situated reasoning in 3d scenes

Vlm-grounder: A vlm agent for zero-shot 3d visual grounding