Real-world robot applications of foundation models: A review

K Kawaharazuka, T Matsushima… - Advanced …, 2024 - Taylor & Francis
Recent developments in foundation models, like Large Language Models (LLMs) and Vision-
Language Models (VLMs), trained on extensive data, facilitate flexible application across …

Autoregressive image generation without vector quantization

T Li, Y Tian, H Li, M Deng, K He - Advances in Neural …, 2025 - proceedings.neurips.cc
Conventional wisdom holds that autoregressive models for image generation are typically
accompanied by vector-quantized tokens. We observe that while a discrete-valued space …

Moka: Open-vocabulary robotic manipulation through mark-based visual prompting

F Liu, K Fang, P Abbeel, S Levine - First Workshop on Vision …, 2024 - openreview.net
Open-vocabulary generalization requires robotic systems to perform tasks involving complex
and diverse environments and task goals. While the recent advances in vision language …

Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation

Z Fu, TZ Zhao, C Finn - arxiv preprint arxiv:2401.02117, 2024 - arxiv.org
Imitation learning from human demonstrations has shown impressive performance in
robotics. However, most results focus on table-top manipulation, lacking the mobility and …

Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers

L Wang, X Chen, J Zhao, K He - arxiv preprint arxiv:2409.20537, 2024 - arxiv.org
One of the roadblocks for training generalist robotic models today is heterogeneity. Previous
robot learning methods often collect data to train with one specific embodiment for one task …

Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation

R Doshi, H Walke, O Mees, S Dasari… - arxiv preprint arxiv …, 2024 - arxiv.org
Modern machine learning systems rely on large datasets to attain broad generalization, and
this often poses a challenge in robot learning, where each robotic platform and task might …

Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation

CL Cheang, G Chen, Y **g, T Kong, H Li, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org
We present GR-2, a state-of-the-art generalist robot agent for versatile and generalizable
robot manipulation. GR-2 is first pre-trained on a vast number of Internet videos to capture …

: A Vision-Language-Action Flow Model for General Robot Control

K Black, N Brown, D Driess, A Esmail, M Equi… - arxiv preprint arxiv …, 2024 - arxiv.org
Robot learning holds tremendous promise to unlock the full potential of flexible, general, and
dexterous robot systems, as well as to address some of the deepest questions in artificial …

Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation

J Wen, Y Zhu, J Li, M Zhu, K Wu, Z Xu, N Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor
control and instruction comprehension through end-to-end learning processes. However …

Quest: Self-supervised skill abstractions for learning continuous control

A Mete, H Xue, A Wilcox, Y Chen… - Advances in Neural …, 2025 - proceedings.neurips.cc
Generalization capabilities, or rather a lack thereof, is one of the most important unsolved
problems in the field of robot learning, and while several large scale efforts have set out to …