Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Deep learning-based action detection in untrimmed videos: A survey

E Vahdani, Y Tian - IEEE Transactions on Pattern Analysis and …, 2022 - ieeexplore.ieee.org
Understanding human behavior and activity facilitates advancement of numerous real-world
applications, and is critical for video analysis. Despite the progress of action recognition …

Advancing high-resolution video-language representation with large-scale video transcriptions

H Xue, T Hang, Y Zeng, Y Sun, B Liu… - Proceedings of the …, 2022 - openaccess.thecvf.com
We study joint video and language (VL) pre-training to enable cross-modality learning and
benefit plentiful downstream VL tasks. Existing works either extract low-quality video …

TallFormer: Temporal Action Localization with a Long-Memory Transformer

F Cheng, G Bertasius - European Conference on Computer Vision, 2022 - Springer
Most modern approaches in temporal action localization divide this problem into two parts:(i)
short-term feature extraction and (ii) long-range temporal boundary localization. Due to the …

Temporal action detection with structured segment networks

Y Zhao, Y **ong, L Wang, Z Wu… - Proceedings of the …, 2017 - openaccess.thecvf.com
Detecting actions in untrimmed videos is an important yet challenging task. In this paper, we
present the structured segment network (SSN), a novel framework which models the …

Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment

D Zhang, X Dai, X Wang, YF Wang… - Proceedings of the …, 2019 - openaccess.thecvf.com
This research strives for natural language moment retrieval in long, untrimmed video
streams. The problem is not trivial especially when a video contains multiple moments of …

Weakly-supervised action localization by generative attention modeling

B Shi, Q Dai, Y Mu, J Wang - Proceedings of the IEEE/CVF …, 2020 - openaccess.thecvf.com
Weakly-supervised temporal action localization is a problem of learning an action
localization model with only video-level action labeling available. The general framework …

Exploring denoised cross-video contrast for weakly-supervised temporal action localization

J Li, T Yang, W Ji, J Wang… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Weakly-supervised temporal action localization aims to localize actions in untrimmed videos
with only video-level labels. Most existing methods address this problem with a" localization …

An efficient spatio-temporal pyramid transformer for action detection

Y Weng, Z Pan, M Han, X Chang, B Zhuang - European Conference on …, 2022 - Springer
The task of action detection aims at deducing both the action category and localization of the
start and end moment for each action instance in a long, untrimmed video. While vision …

Top-heavy CapsNets based on spatiotemporal non-local for action recognition

MH Ha - Journal of Computing Theories and Applications, 2024 - dl.futuretechsci.org
To effectively comprehend human actions, we have developed a Deep Neural Network
(DNN) that utilizes inner spatiotemporal non-locality to capture meaningful semantic context …