Prompting visual-language models for efficient video understanding
Image-based visual-language (I-VL) pre-training has shown great success for learning joint
visual-textual representations from large-scale web data, revealing remarkable ability for …
visual-textual representations from large-scale web data, revealing remarkable ability for …
How do you do it? fine-grained action understanding with pseudo-adverbs
We aim to understand how actions are performed and identify subtle differences, such as'
fold firmly'vs.'fold gently'. To this end, we propose a method which recognizes adverbs …
fold firmly'vs.'fold gently'. To this end, we propose a method which recognizes adverbs …
Alignment-uniformity aware representation learning for zero-shot video classification
Most methods tackle zero-shot video classification by aligning visual-semantic
representations within seen classes, which limits generalization to unseen classes. To …
representations within seen classes, which limits generalization to unseen classes. To …
Actionhub: a large-scale action video description dataset for zero-shot action recognition
Zero-shot action recognition (ZSAR) aims to learn an alignment model between videos and
class descriptions of seen actions that is transferable to unseen actions. The text queries …
class descriptions of seen actions that is transferable to unseen actions. The text queries …
Tell me what you see: A zero-shot action recognition method based on natural language descriptions
This paper presents a novel approach to Zero-Shot Action Recognition. Recent works have
explored the detection and classification of objects to obtain semantic information from …
explored the detection and classification of objects to obtain semantic information from …
Deconfounding causal inference for zero-shot action recognition
Zero-shot action recognition (ZSAR) aims to recognize unseen action categories in the test
set without corresponding training examples. Most existing zero-shot methods follow the …
set without corresponding training examples. Most existing zero-shot methods follow the …
Routing evidence for unseen actions in video moment retrieval
Video moment retrieval (VMR) is a cutting-edge vision-language task locating a segment in
a video according to the query. Though the methods have achieved significant performance …
a video according to the query. Though the methods have achieved significant performance …
Bi-calibration networks for weakly-supervised video representation learning
The leverage of large volumes of web videos paired with the query (short phrase for
searching the video) or surrounding text (long textual description, eg, video title) offers an …
searching the video) or surrounding text (long textual description, eg, video title) offers an …
Video Attribute Prototype Network: A New Perspective for Zero-Shot Video Classification
Video attributes, which leverage video contents to instantiate class semantics, play a critical
role in diversifying semantics in zero-shot video classification, thereby facilitating semantic …
role in diversifying semantics in zero-shot video classification, thereby facilitating semantic …
Zero-shot action recognition from diverse object-scene compositions
This paper investigates the problem of zero-shot action recognition, in the setting where no
training videos with seen actions are available. For this challenging scenario, the current …
training videos with seen actions are available. For this challenging scenario, the current …