Real-world robot applications of foundation models: A review
Recent developments in foundation models, like Large Language Models (LLMs) and Vision-
Language Models (VLMs), trained on extensive data, facilitate flexible application across …
Language Models (VLMs), trained on extensive data, facilitate flexible application across …
Autoregressive image generation without vector quantization
Conventional wisdom holds that autoregressive models for image generation are typically
accompanied by vector-quantized tokens. We observe that while a discrete-valued space …
accompanied by vector-quantized tokens. We observe that while a discrete-valued space …
Moka: Open-vocabulary robotic manipulation through mark-based visual prompting
Open-vocabulary generalization requires robotic systems to perform tasks involving complex
and diverse environments and task goals. While the recent advances in vision language …
and diverse environments and task goals. While the recent advances in vision language …
Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation
Imitation learning from human demonstrations has shown impressive performance in
robotics. However, most results focus on table-top manipulation, lacking the mobility and …
robotics. However, most results focus on table-top manipulation, lacking the mobility and …
Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers
One of the roadblocks for training generalist robotic models today is heterogeneity. Previous
robot learning methods often collect data to train with one specific embodiment for one task …
robot learning methods often collect data to train with one specific embodiment for one task …
Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation
Modern machine learning systems rely on large datasets to attain broad generalization, and
this often poses a challenge in robot learning, where each robotic platform and task might …
this often poses a challenge in robot learning, where each robotic platform and task might …
Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation
We present GR-2, a state-of-the-art generalist robot agent for versatile and generalizable
robot manipulation. GR-2 is first pre-trained on a vast number of Internet videos to capture …
robot manipulation. GR-2 is first pre-trained on a vast number of Internet videos to capture …
: A Vision-Language-Action Flow Model for General Robot Control
Robot learning holds tremendous promise to unlock the full potential of flexible, general, and
dexterous robot systems, as well as to address some of the deepest questions in artificial …
dexterous robot systems, as well as to address some of the deepest questions in artificial …
Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor
control and instruction comprehension through end-to-end learning processes. However …
control and instruction comprehension through end-to-end learning processes. However …
Quest: Self-supervised skill abstractions for learning continuous control
Generalization capabilities, or rather a lack thereof, is one of the most important unsolved
problems in the field of robot learning, and while several large scale efforts have set out to …
problems in the field of robot learning, and while several large scale efforts have set out to …