Affordances from human videos as a versatile representation for robotics
Building a robot that can understand and learn to interact by watching humans has inspired
several vision problems. However, despite some successful results on static datasets, it …
several vision problems. However, despite some successful results on static datasets, it …
Human activity recognition (har) using deep learning: Review, methodologies, progress and future research directions
Human activity recognition is essential in many domains, including the medical and smart
home sectors. Using deep learning, we conduct a comprehensive survey of current state …
home sectors. Using deep learning, we conduct a comprehensive survey of current state …
Retrospectives on the embodied ai workshop
We present a retrospective on the state of Embodied AI research. Our analysis focuses on
13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are …
13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are …
Egocentric audio-visual object localization
Humans naturally perceive surrounding scenes by unifying sound and sight in a first-person
view. Likewise, machines are advanced to approach human intelligence by learning with …
view. Likewise, machines are advanced to approach human intelligence by learning with …
Self-supervised visual learning from interactions with objects
Self-supervised learning (SSL) has revolutionized visual representation learning, but has
not achieved the robustness of human vision. A reason for this could be that SSL does not …
not achieved the robustness of human vision. A reason for this could be that SSL does not …
Hyperbolic audio-visual zero-shot learning
Audio-visual zero-shot learning aims to classify samples consisting of a pair of
corresponding audio and video sequences from classes that are not present during training …
corresponding audio and video sequences from classes that are not present during training …
Soundingactions: Learning how actions sound from narrated egocentric videos
We propose a novel self-supervised embedding to learn how actions sound from narrated in-
the-wild egocentric videos. Whereas existing methods rely on curated data with known …
the-wild egocentric videos. Whereas existing methods rely on curated data with known …
HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition
Abstract Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in
recent years for its critical role in creating emotion-aware intelligent machines. Previous …
recent years for its critical role in creating emotion-aware intelligent machines. Previous …
Multi-task learning of object states and state-modifying actions from web videos
We aim to learn to temporally localize object state changes and the corresponding state-
modifying actions by observing people interacting with objects in long uncurated web …
modifying actions by observing people interacting with objects in long uncurated web …
Interaction region visual transformer for egocentric action anticipation
D Roy, R Rajendiran… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Human-object interaction (HOI) and temporal dynamics along the motion paths are the most
important visual cues for egocentric action anticipation. Especially, interaction regions …
important visual cues for egocentric action anticipation. Especially, interaction regions …