- Academic Search

C Tang, B Abbatematteo, J Hu… - Annual Review of …, 2024 - annualreviews.org

Reinforcement learning (RL), particularly its combination with deep neural networks,
referred to as deep RL (DRL), has shown tremendous promise across a wide range of …

Save Cite Cited by 20 Related articles All 3 versions Free GPT-4

[Free GPT-4]

[PDF] tandfonline.com

Real-world robot applications of foundation models: A review

K Kawaharazuka, T Matsushima… - Advanced …, 2024 - Taylor & Francis

Recent developments in foundation models, like Large Language Models (LLMs) and Vision-
Language Models (VLMs), trained on extensive data, facilitate flexible application across …

Save Cite Cited by 39 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Foundation models in robotics: Applications, challenges, and the future

R Firoozi, J Tucker, S Tian… - … Journal of Robotics …, 2023 - journals.sagepub.com

We survey applications of pretrained foundation models in robotics. Traditional deep
learning models in robotics are trained on small datasets tailored for specific tasks, which …

Save Cite Cited by 124 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Octo: An open-source generalist robot policy

OM Team, D Ghosh, H Walke, K Pertsch… - arxiv preprint arxiv …, 2024 - arxiv.org

Large policies pretrained on diverse robot datasets have the potential to transform robotic
learning: instead of training new policies from scratch, such generalist robot policies may be …

Save Cite Cited by 163 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

Save Cite Cited by 103 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Drivevlm: The convergence of autonomous driving and large vision-language models

X Tian, J Gu, B Li, Y Liu, Y Wang, Z Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

A primary hurdle of autonomous driving in urban environments is understanding complex
and long-tail scenarios, such as challenging road conditions and delicate human behaviors …

Save Cite Cited by 96 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation

Z Fu, TZ Zhao, C Finn - arxiv preprint arxiv:2401.02117, 2024 - arxiv.org

Imitation learning from human demonstrations has shown impressive performance in
robotics. However, most results focus on table-top manipulation, lacking the mobility and …

Save Cite Cited by 219 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] openreview.net

Moka: Open-vocabulary robotic manipulation through mark-based visual prompting

F Liu, K Fang, P Abbeel, S Levine - First Workshop on Vision …, 2024 - openreview.net

Open-vocabulary generalization requires robotic systems to perform tasks involving complex
and diverse environments and task goals. While the recent advances in vision language …

Save Cite Cited by 52 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Open-television: Teleoperation with immersive active visual feedback

X Cheng, J Li, S Yang, G Yang, X Wang - arxiv preprint arxiv:2407.01512, 2024 - arxiv.org

Teleoperation serves as a powerful method for collecting on-robot data essential for robot
learning from demonstrations. The intuitiveness and ease of use of the teleoperation system …

Save Cite Cited by 44 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Longvila: Scaling long-context visual language models for long videos

F Xue, Y Chen, D Li, Q Hu, L Zhu, X Li, Y Fang… - arxiv preprint arxiv …, 2024 - arxiv.org

Long-context capability is critical for multi-modal foundation models, especially for long
video understanding. We introduce LongVILA, a full-stack solution for long-context visual …

Save Cite Cited by 29 Related articles All 4 versions Free GPT-4 View as HTML

Cite

Advanced search

Saved to My library

Deep reinforcement learning for robotics: A survey of real-world successes

Real-world robot applications of foundation models: A review

Foundation models in robotics: Applications, challenges, and the future

Octo: An open-source generalist robot policy

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

Drivevlm: The convergence of autonomous driving and large vision-language models

Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation

Moka: Open-vocabulary robotic manipulation through mark-based visual prompting

Open-television: Teleoperation with immersive active visual feedback

Longvila: Scaling long-context visual language models for long videos