Robocasa: Large-scale simulation of everyday tasks for generalist robots

S Nasiriany, A Maddukuri, L Zhang, A Parikh… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in Artificial Intelligence (AI) have largely been propelled by scaling. In
Robotics, scaling is hindered by the lack of access to massive robot datasets. We advocate …

Benchmark evaluations, applications, and challenges of large vision language models: A survey

Z Li, X Wu, H Du, H Nghiem, G Shi - arxiv preprint arxiv:2501.02189, 2025 - arxiv.org
Multimodal Vision Language Models (VLMs) have emerged as a transformative technology
at the intersection of computer vision and natural language processing, enabling machines …

Pushing the limits of cross-embodiment learning for manipulation and navigation

J Yang, C Glossop, A Bhorkar, D Shah… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent years in robotics and imitation learning have shown remarkable progress in training
large-scale foundation models by leveraging data across a multitude of embodiments. The …

The colosseum: A benchmark for evaluating generalization for robotic manipulation

W Pumacay, I Singh, J Duan, R Krishna… - arxiv preprint arxiv …, 2024 - arxiv.org
To realize effective large-scale, real-world robotic applications, we must evaluate how well
our robot policies adapt to changes in environmental conditions. Unfortunately, a majority of …

Towards efficient llm grounding for embodied multi-agent collaboration

Y Zhang, S Yang, C Bai, F Wu, X Li, Z Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Grounding the reasoning ability of large language models (LLMs) for embodied tasks is
challenging due to the complexity of the physical world. Especially, LLM planning for multi …

Policy adaptation via language optimization: Decomposing tasks for few-shot imitation

V Myers, BC Zheng, O Mees, S Levine… - arxiv preprint arxiv …, 2024 - arxiv.org
Learned language-conditioned robot policies often struggle to effectively adapt to new real-
world tasks even when pre-trained across a diverse set of instructions. We propose a novel …

Thinking in space: How multimodal large language models see, remember, and recall spaces

J Yang, S Yang, AW Gupta, R Han, L Fei-Fei… - arxiv preprint arxiv …, 2024 - arxiv.org
Humans possess the visual-spatial intelligence to remember spaces from sequential visual
observations. However, can Multimodal Large Language Models (MLLMs) trained on million …

Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation

Q Li, Y Liang, Z Wang, L Luo, X Chen, M Liao… - arxiv preprint arxiv …, 2024 - arxiv.org
The advancement of large Vision-Language-Action (VLA) models has significantly improved
robotic manipulation in terms of language-guided task execution and generalization to …

A survey of robotic language grounding: Tradeoffs between symbols and embeddings

V Cohen, JX Liu, R Mooney, S Tellex… - arxiv preprint arxiv …, 2024 - arxiv.org
With large language models, robots can understand language more flexibly and more
capable than ever before. This survey reviews and situates recent literature into a spectrum …

Anycar to anywhere: Learning universal dynamics model for agile and adaptive mobility

W **ao, H Xue, T Tao, D Kalaria, JM Dolan… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent works in the robot learning community have successfully introduced generalist
models capable of controlling various robot embodiments across a wide range of tasks, such …