Affordances from human videos as a versatile representation for robotics

S Bahl, R Mendonca, L Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com
Building a robot that can understand and learn to interact by watching humans has inspired
several vision problems. However, despite some successful results on static datasets, it …

Human activity recognition (har) using deep learning: Review, methodologies, progress and future research directions

P Kumar, S Chauhan, LK Awasthi - Archives of Computational Methods in …, 2024 - Springer
Human activity recognition is essential in many domains, including the medical and smart
home sectors. Using deep learning, we conduct a comprehensive survey of current state …

Retrospectives on the embodied ai workshop

M Deitke, D Batra, Y Bisk, T Campari, AX Chang… - arxiv preprint arxiv …, 2022 - arxiv.org
We present a retrospective on the state of Embodied AI research. Our analysis focuses on
13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are …

Egocentric audio-visual object localization

C Huang, Y Tian, A Kumar… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Humans naturally perceive surrounding scenes by unifying sound and sight in a first-person
view. Likewise, machines are advanced to approach human intelligence by learning with …

Self-supervised visual learning from interactions with objects

A Aubret, C Teulière, J Triesch - European Conference on Computer …, 2024 - Springer
Self-supervised learning (SSL) has revolutionized visual representation learning, but has
not achieved the robustness of human vision. A reason for this could be that SSL does not …

Hyperbolic audio-visual zero-shot learning

J Hong, Z Hayder, J Han, P Fang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Audio-visual zero-shot learning aims to classify samples consisting of a pair of
corresponding audio and video sequences from classes that are not present during training …

Soundingactions: Learning how actions sound from narrated egocentric videos

C Chen, K Ashutosh, R Girdhar… - Proceedings of the …, 2024 - openaccess.thecvf.com
We propose a novel self-supervised embedding to learn how actions sound from narrated in-
the-wild egocentric videos. Whereas existing methods rely on curated data with known …

HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition

L Sun, Z Lian, B Liu, J Tao - Information Fusion, 2024 - Elsevier
Abstract Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in
recent years for its critical role in creating emotion-aware intelligent machines. Previous …

Multi-task learning of object states and state-modifying actions from web videos

T Soucek, JB Alayrac, A Miech, I Laptev… - IEEE Transactions on …, 2024 - computer.org
We aim to learn to temporally localize object state changes and the corresponding state-
modifying actions by observing people interacting with objects in long uncurated web …

Interaction region visual transformer for egocentric action anticipation

D Roy, R Rajendiran… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Human-object interaction (HOI) and temporal dynamics along the motion paths are the most
important visual cues for egocentric action anticipation. Especially, interaction regions …