Aligning cyber space with physical world: A comprehensive survey on embodied ai

Y Liu, W Chen, Y Bai, X Liang, G Li, W Gao… - arxiv preprint arxiv …, 2024 - arxiv.org
Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial General
Intelligence (AGI) and serves as a foundation for various applications that bridge cyberspace …

Navgpt-2: Unleashing navigational reasoning capability for large vision-language models

G Zhou, Y Hong, Z Wang, XE Wang, Q Wu - European Conference on …, 2024 - Springer
Capitalizing on the remarkable advancements in Large Language Models (LLMs), there is a
burgeoning initiative to harness LLMs for instruction following robotic navigation. Such a …

Mapgpt: Map-guided prompting with adaptive path planning for vision-and-language navigation

J Chen, B Lin, R Xu, Z Chai, X Liang… - Proceedings of the …, 2024 - aclanthology.org
Embodied agents equipped with GPT as their brain have exhibited extraordinary decision-
making and generalization abilities across various tasks. However, existing zero-shot agents …

Vision-and-language navigation today and tomorrow: A survey in the era of foundation models

Y Zhang, Z Ma, J Li, Y Qiao, Z Wang, J Chai… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-and-Language Navigation (VLN) has gained increasing attention over recent years
and many approaches have emerged to advance their development. The remarkable …

Large multimodal agents: A survey

J **e, Z Chen, R Zhang, X Wan, G Li - arxiv preprint arxiv:2402.15116, 2024 - arxiv.org
Large language models (LLMs) have achieved superior performance in powering text-
based AI agents, endowing them with decision-making and reasoning abilities akin to …

Understanding World or Predicting Future? A Comprehensive Survey of World Models

J Ding, Y Zhang, Y Shang, Y Zhang, Z Zong… - arxiv preprint arxiv …, 2024 - arxiv.org
The concept of world models has garnered significant attention due to advancements in
multimodal large language models such as GPT-4 and video generation models such as …

Mapgpt: Map-guided prompting for unified vision-and-language navigation

J Chen, B Lin, R Xu, Z Chai, X Liang… - arxiv preprint arxiv …, 2024 - arxiv.org
Embodied agents equipped with GPT as their brain have exhibited extraordinary thinking
and decision-making abilities across various tasks. However, existing zero-shot agents for …

Sim-to-real transfer via 3d feature fields for vision-and-language navigation

Z Wang, X Li, J Yang, Y Liu, S Jiang - arxiv preprint arxiv:2406.09798, 2024 - arxiv.org
Vision-and-language navigation (VLN) enables the agent to navigate to a remote location in
3D environments following the natural language instruction. In this field, the agent is usually …

Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

J Chen, B Lin, X Liu, L Ma, X Liang… - arxiv preprint arxiv …, 2024 - arxiv.org
LLM-based agents have demonstrated impressive zero-shot performance in vision-
language navigation (VLN) task. However, existing LLM-based methods often focus only on …

Thinkbot: Embodied instruction following with thought chain reasoning

G Lu, Z Wang, C Liu, J Lu, Y Tang - arxiv preprint arxiv:2312.07062, 2023 - arxiv.org
Embodied Instruction Following (EIF) requires agents to complete human instruction by
interacting objects in complicated surrounding environments. Conventional methods directly …