Video action transformer network
Abstract We introduce the Action Transformer model for recognizing and localizing human
actions in video clips. We repurpose a Transformer-style architecture to aggregate features …
actions in video clips. We repurpose a Transformer-style architecture to aggregate features …
The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision
We propose the Neuro-Symbolic Concept Learner (NS-CL), a model that learns visual
concepts, words, and semantic parsing of sentences without explicit supervision on any of …
concepts, words, and semantic parsing of sentences without explicit supervision on any of …
Long-term feature banks for detailed video understanding
To understand the world, we humans constantly need to relate the present to the past, and
put events in context. In this paper, we enable existing video models to do the same. We …
put events in context. In this paper, we enable existing video models to do the same. We …
Compositional chain-of-thought prompting for large multimodal models
The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …
Action genome: Actions as compositions of spatio-temporal scene graphs
Action recognition has typically treated actions and activities as monolithic events that occur
in videos. However, there is evidence from Cognitive Science and Neuroscience that people …
in videos. However, there is evidence from Cognitive Science and Neuroscience that people …
Understanding human hands in contact at internet scale
Hands are the central means by which humans manipulate their world and being able to
reliably extract hand state information from Internet videos of humans engaged in their …
reliably extract hand state information from Internet videos of humans engaged in their …
Epic-kitchens visor benchmark: Video segmentations and object relations
We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for
segmenting hands and active objects in egocentric video. VISOR annotates videos from …
segmenting hands and active objects in egocentric video. VISOR annotates videos from …
Large-scale weakly-supervised pre-training for video action recognition
Current fully-supervised video datasets consist of only a few hundred thousand videos and
fewer than a thousand domain-specific labels. This hinders the progress towards advanced …
fewer than a thousand domain-specific labels. This hinders the progress towards advanced …
What makes training multi-modal classification networks hard?
Consider end-to-end training of a multi-modal vs. a uni-modal network on a task with
multiple input modalities: the multi-modal network receives more information, so it should …
multiple input modalities: the multi-modal network receives more information, so it should …
Towards long-form video understanding
Our world offers a never-ending stream of visual stimuli, yet today's vision systems only
accurately recognize patterns within a few seconds. These systems understand the present …
accurately recognize patterns within a few seconds. These systems understand the present …