Video transformers: A survey

J Selva, AS Johansen, S Escalera… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer models have shown great success handling long-range interactions, making
them a promising tool for modeling video. However, they lack inductive biases and scale …

Multi-scale video anomaly detection by multi-grained spatio-temporal representation learning

M Zhang, J Wang, Q Qi, H Sun… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract ecent progress in video anomaly detection suggests that the features of
appearance and motion play crucial roles in distinguishing abnormal patterns from normal …

Efficient video action detection with token dropout and context refinement

L Chen, Z Tong, Y Song, G Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Streaming video clips with large-scale video tokens impede vision transformers (ViTs) for
efficient recognition, especially in video action detection where sufficient spatiotemporal …

Movqa: A benchmark of versatile question-answering for long-form movie understanding

H Zhang, Y Liu, L Dong, Y Huang, ZH Ling… - arxiv preprint arxiv …, 2023 - arxiv.org
While several long-form VideoQA datasets have been introduced, the length of both videos
used to curate questions and sub-clips of clues leveraged to answer those questions have …

Hig: Hierarchical interlacement graph approach to scene graph generation in video understanding

TT Nguyen, P Nguyen, K Luu - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Visual interactivity understanding within visual scenes presents a significant challenge in
computer vision. Existing methods focus on complex interactivities while leveraging a simple …

A video is worth 4096 tokens: Verbalize videos to understand them in zero shot

A Bhattacharya, YK Singla, B Krishnamurthy… - arxiv preprint arxiv …, 2023 - arxiv.org
Multimedia content, such as advertisements and story videos, exhibit a rich blend of
creativity and multiple modalities. They incorporate elements like text, visuals, audio, and …

Long-range multimodal pretraining for movie understanding

DM Argaw, JY Lee, M Woodson… - Proceedings of the …, 2023 - openaccess.thecvf.com
Learning computer vision models from (and for) movies has a long-standing history. While
great progress has been attained, there is still a need for a pretrained multimodal model that …

Grounded video situation recognition

Z Khan, CV Jawahar… - Advances in Neural …, 2022 - proceedings.neurips.cc
Dense video understanding requires answering several questions such as who is doing
what to whom, with what, how, why, and where. Recently, Video Situation Recognition …

Video event extraction with multi-view interaction knowledge distillation

K Wei, R Du, L **, J Liu, J Yin, L Zhang, J Liu… - Proceedings of the …, 2024 - ojs.aaai.org
Video event extraction (VEE) aims to extract key events and generate the event arguments
for their semantic roles from the video. Despite promising results have been achieved by …

Clipsitu: Effectively leveraging clip for conditional predictions in situation recognition

D Roy, D Verma, B Fernando - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Situation Recognition is the task of generating a structured summary of what is happening in
an image using an activity verb and the semantic roles played by actors and objects. In this …