Real-world robot applications of foundation models: A review

K Kawaharazuka, T Matsushima… - Advanced …, 2024 - Taylor & Francis
Recent developments in foundation models, like Large Language Models (LLMs) and Vision-
Language Models (VLMs), trained on extensive data, facilitate flexible application across …

Embodied navigation with multi-modal information: A survey from tasks to methodology

Y Wu, P Zhang, M Gu, J Zheng, X Bai - Information Fusion, 2024 - Elsevier
Embodied AI aims to create agents that complete complex tasks by interacting with the
environment. A key problem in this field is embodied navigation which understands multi …

Shapellm: Universal 3d object understanding for embodied interaction

Z Qi, R Dong, S Zhang, H Geng, C Han, Z Ge… - … on Computer Vision, 2024 - Springer
This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM)
designed for embodied interaction, exploring a universal 3D object understanding with 3D …

Multi3drefer: Grounding text description to multiple 3d objects

Y Zhang, ZM Gong, AX Chang - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
We introduce the task of localizing a flexible number of objects in real-world 3D scenes
using natural language descriptions. Existing 3D visual grounding tasks focus on localizing …

Ok-robot: What really matters in integrating open-knowledge models for robotics

P Liu, Y Orru, J Vakil, C Paxton, NMM Shafiullah… - arxiv preprint arxiv …, 2024 - arxiv.org
Remarkable progress has been made in recent years in the fields of vision, language, and
robotics. We now have vision models capable of recognizing objects based on language …

Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation

J Wen, Y Zhu, J Li, M Zhu, K Wu, Z Xu, N Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor
control and instruction comprehension through end-to-end learning processes. However …

Vision-and-language navigation today and tomorrow: A survey in the era of foundation models

Y Zhang, Z Ma, J Li, Y Qiao, Z Wang, J Chai… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-and-Language Navigation (VLN) has gained increasing attention over recent years
and many approaches have emerged to advance their development. The remarkable …

Goat-bench: A benchmark for multi-modal lifelong navigation

M Khanna, R Ramrakhya… - Proceedings of the …, 2024 - openaccess.thecvf.com
The Embodied AI community has recently made significant strides in visual navigation tasks
exploring targets from 3D coordinates objects language description and images. However …

Adaptive mobile manipulation for articulated objects in the open world

H **ong, R Mendonca, K Shaw, D Pathak - arxiv preprint arxiv …, 2024 - arxiv.org
Deploying robots in open-ended unstructured environments such as homes has been a long-
standing research problem. However, robots are often studied only in closed-off lab settings …

Poliformer: Scaling on-policy rl with transformers results in masterful navigators

KH Zeng, Z Zhang, K Ehsani, R Hendrix… - arxiv preprint arxiv …, 2024 - arxiv.org
We present PoliFormer (Policy Transformer), an RGB-only indoor navigation agent trained
end-to-end with reinforcement learning at scale that generalizes to the real-world without …