Humanplus: Humanoid shadowing and imitation from humans

Z Fu, Q Zhao, Q Wu, G Wetzstein, C Finn - arxiv preprint arxiv:2406.10454, 2024 - arxiv.org
One of the key arguments for building robots that have similar form factors to human beings
is that we can leverage the massive human data for training. Yet, doing so has remained …

Towards generalist robot learning from internet video: A survey

R McCarthy, DCH Tan, D Schmidt, F Acero… - arxiv preprint arxiv …, 2024 - arxiv.org
Scaling deep learning to massive, diverse internet data has yielded remarkably general
capabilities in visual and natural language understanding and generation. However, data …

Genhowto: Learning to generate actions and state transformations from instructional videos

T Souček, D Damen, M Wray… - Proceedings of the …, 2024 - openaccess.thecvf.com
We address the task of generating temporally consistent and physically plausible images of
actions and object state transformations. Given an input image and a text prompt describing …

Llara: Supercharging robot learning data for vision-language policy

X Li, C Mata, J Park, K Kahatapitiya, YS Jang… - arxiv preprint arxiv …, 2024 - arxiv.org
LLMs with visual inputs, ie, Vision Language Models (VLMs), have the capacity to process
state information as visual-textual prompts and respond with policy decisions in text. We …

Uni-o4: Unifying online and offline deep reinforcement learning with multi-step on-policy optimization

K Lei, Z He, C Lu, K Hu, Y Gao, H Xu - arxiv preprint arxiv:2311.03351, 2023 - arxiv.org
Combining offline and online reinforcement learning (RL) is crucial for efficient and safe
learning. However, previous approaches treat offline and online learning as separate …

Spatiotemporal predictive pre-training for robotic motor control

J Yang, B Liu, J Fu, B Pan, G Wu, L Wang - arxiv preprint arxiv …, 2024 - arxiv.org
Robotic motor control necessitates the ability to predict the dynamics of environments and
interaction objects. However, advanced self-supervised pre-trained visual representations in …

Oscar: Object state captioning and state change representation

N Nguyen, J Bi, A Vosoughi, Y Tian, P Fazli… - arxiv preprint arxiv …, 2024 - arxiv.org
The capability of intelligent models to extrapolate and comprehend changes in object states
is a crucial yet demanding aspect of AI research, particularly through the lens of human …

Learning multi-step manipulation tasks from a single human demonstration

D Guo - arxiv preprint arxiv:2312.15346, 2023 - arxiv.org
Learning from human demonstrations has exhibited remarkable achievements in robot
manipulation. However, the challenge remains to develop a robot system that matches …

Towards empowerment gain through causal structure learning in model-based rl

H Cao, F Feng, M Fang, S Dong, T Yang, J Huo… - arxiv preprint arxiv …, 2025 - arxiv.org
In Model-Based Reinforcement Learning (MBRL), incorporating causal structures into
dynamics models provides agents with a structured understanding of the environments …

Grounding Video Models to Actions through Goal Conditioned Exploration

Y Luo, Y Du - arxiv preprint arxiv:2411.07223, 2024 - arxiv.org
Large video models, pretrained on massive amounts of Internet video, provide a rich source
of physical knowledge about the dynamics and motions of objects and tasks. However …