Learning multi-object dynamics with compositional neural radiance fields
We present a method to learn compositional multi-object dynamics models from image
observations based on implicit object encoders, Neural Radiance Fields (NeRFs), and …
observations based on implicit object encoders, Neural Radiance Fields (NeRFs), and …
Joint hand motion and interaction hotspots prediction from egocentric videos
We propose to forecast future hand-object interactions given an egocentric video. Instead of
predicting action labels or pixels, we directly predict the hand motion trajectory and the …
predicting action labels or pixels, we directly predict the hand motion trajectory and the …
Slotformer: Unsupervised visual dynamics simulation with object-centric models
Understanding dynamics from visual observations is a challenging problem that requires
disentangling individual objects from the scene and learning their interactions. While recent …
disentangling individual objects from the scene and learning their interactions. While recent …
Dynamic visual reasoning by learning differentiable physics models from video and language
In this work, we propose a unified framework, called Visual Reasoning with Differ-entiable
Physics (VRDP), that can jointly learn visual concepts and infer physics models of objects …
Physics (VRDP), that can jointly learn visual concepts and infer physics models of objects …
Graph inverse reinforcement learning from diverse videos
Abstract Research on Inverse Reinforcement Learning (IRL) from third-person videos has
shown encouraging results on removing the need for manual reward design for robotic …
shown encouraging results on removing the need for manual reward design for robotic …
Neural production systems
Visual environments are structured, consisting of distinct objects or entities. These entities
have properties---visible or latent---that determine the manner in which they interact with one …
have properties---visible or latent---that determine the manner in which they interact with one …
Visual reinforcement learning with self-supervised 3d representations
A prominent approach to visual Reinforcement Learning (RL) is to learn an internal state
representation using self-supervised methods, which has the potential benefit of improved …
representation using self-supervised methods, which has the potential benefit of improved …
Physion: Evaluating physical prediction from vision in humans and machines
While current vision algorithms excel at many challenging tasks, it is unclear how well they
understand the physical dynamics of real-world environments. Here we introduce Physion, a …
understand the physical dynamics of real-world environments. Here we introduce Physion, a …
Progressive instance-aware feature learning for compositional action recognition
In order to enable the model to generalize to unseen “action-objects”(compositional action),
previous methods encode multiple pieces of information (ie, the appearance, position, and …
previous methods encode multiple pieces of information (ie, the appearance, position, and …
Vdt: General-purpose video diffusion transformers via mask modeling
This work introduces Video Diffusion Transformer (VDT), which pioneers the use of
transformers in diffusion-based video generation. It features transformer blocks with …
transformers in diffusion-based video generation. It features transformer blocks with …