Reasoning about actions over visual and linguistic modalities: A survey

SK Sampat, M Patel, S Das, Y Yang, C Baral - arxiv preprint arxiv …, 2022 - arxiv.org
'Actions' play a vital role in how humans interact with the world and enable them to achieve
desired goals. As a result, most common sense (CS) knowledge for humans revolves …

Video2commonsense: Generating commonsense descriptions to enrich video captioning

Z Fang, T Gokhale, P Banerjee, C Baral… - arxiv preprint arxiv …, 2020 - arxiv.org
Captioning is a crucial and challenging task for video understanding. In videos that involve
active agents such as humans, the agent's actions can bring about myriad changes in the …

Cripp-vqa: Counterfactual reasoning about implicit physical properties via video question answering

M Patel, T Gokhale, C Baral, Y Yang - arxiv preprint arxiv:2211.03779, 2022 - arxiv.org
Videos often capture objects, their visible properties, their motion, and the interactions
between different objects. Objects also have physical properties such as mass, which the …

Neural constraint satisfaction: Hierarchical abstraction for combinatorial generalization in object rearrangement

M Chang, AL Dayan, F Meier, TL Griffiths… - arxiv preprint arxiv …, 2023 - arxiv.org
Object rearrangement is a challenge for embodied agents because solving these tasks
requires generalizing across a combinatorially large set of configurations of entities and their …

Hierarchical abstraction for combinatorial generalization in object rearrangement

M Chang, AL Dayan, F Meier, TL Griffiths… - … 2022 Workshop on …, 2022 - openreview.net
Object rearrangement is a challenge for embodied agents because solving these tasks
requires generalizing across a combinatorially large set of underlying entities that take the …

ActionCOMET: A Zero-shot Approach to Learn Image-specific Commonsense Concepts about Actions

SK Sampat, Y Yang, C Baral - arxiv preprint arxiv:2410.13662, 2024 - arxiv.org
Humans observe various actions being performed by other humans (physically or in
videos/images) and can draw a wide range of inferences about it beyond what they can …