Deep reinforcement learning for robotics: A survey of real-world successes

C Tang, B Abbatematteo, J Hu… - Annual Review of …, 2024 - annualreviews.org
Reinforcement learning (RL), particularly its combination with deep neural networks,
referred to as deep RL (DRL), has shown tremendous promise across a wide range of …

Real-world robot applications of foundation models: A review

K Kawaharazuka, T Matsushima… - Advanced …, 2024 - Taylor & Francis
Recent developments in foundation models, like Large Language Models (LLMs) and Vision-
Language Models (VLMs), trained on extensive data, facilitate flexible application across …

Foundation models in robotics: Applications, challenges, and the future

R Firoozi, J Tucker, S Tian… - … Journal of Robotics …, 2023 - journals.sagepub.com
We survey applications of pretrained foundation models in robotics. Traditional deep
learning models in robotics are trained on small datasets tailored for specific tasks, which …

Octo: An open-source generalist robot policy

OM Team, D Ghosh, H Walke, K Pertsch… - arxiv preprint arxiv …, 2024 - arxiv.org
Large policies pretrained on diverse robot datasets have the potential to transform robotic
learning: instead of training new policies from scratch, such generalist robot policies may be …

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

Drivevlm: The convergence of autonomous driving and large vision-language models

X Tian, J Gu, B Li, Y Liu, Y Wang, Z Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
A primary hurdle of autonomous driving in urban environments is understanding complex
and long-tail scenarios, such as challenging road conditions and delicate human behaviors …

Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation

Z Fu, TZ Zhao, C Finn - arxiv preprint arxiv:2401.02117, 2024 - arxiv.org
Imitation learning from human demonstrations has shown impressive performance in
robotics. However, most results focus on table-top manipulation, lacking the mobility and …

Moka: Open-vocabulary robotic manipulation through mark-based visual prompting

F Liu, K Fang, P Abbeel, S Levine - First Workshop on Vision …, 2024 - openreview.net
Open-vocabulary generalization requires robotic systems to perform tasks involving complex
and diverse environments and task goals. While the recent advances in vision language …

Open-television: Teleoperation with immersive active visual feedback

X Cheng, J Li, S Yang, G Yang, X Wang - arxiv preprint arxiv:2407.01512, 2024 - arxiv.org
Teleoperation serves as a powerful method for collecting on-robot data essential for robot
learning from demonstrations. The intuitiveness and ease of use of the teleoperation system …

Longvila: Scaling long-context visual language models for long videos

F Xue, Y Chen, D Li, Q Hu, L Zhu, X Li, Y Fang… - arxiv preprint arxiv …, 2024 - arxiv.org
Long-context capability is critical for multi-modal foundation models, especially for long
video understanding. We introduce LongVILA, a full-stack solution for long-context visual …