Videomae v2: Scaling video masked autoencoders with dual masking

L Wang, B Huang, Z Zhao, Z Tong… - Proceedings of the …, 2023 - openaccess.thecvf.com
Scale is the primary factor for building a powerful foundation model that could well
generalize to a variety of downstream tasks. However, it is still challenging to train video …

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

Videocomposer: Compositional video synthesis with motion controllability

X Wang, H Yuan, S Zhang, D Chen… - Advances in …, 2024 - proceedings.neurips.cc
The pursuit of controllability as a higher standard of visual content creation has yielded
remarkable progress in customizable image synthesis. However, achieving controllable …

Tridet: Temporal action detection with relative boundary modeling

D Shi, Y Zhong, Q Cao, L Ma, J Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this paper, we present a one-stage framework TriDet for temporal action detection.
Existing methods often suffer from imprecise boundary predictions due to the ambiguous …

Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering

D Gao, L Zhou, L Ji, L Zhu, Y Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract To build Video Question Answering (VideoQA) systems capable of assisting
humans in daily activities, seeking answers from long-form videos with diverse and complex …

Vidchapters-7m: Video chapters at scale

A Yang, A Nagrani, I Laptev, J Sivic… - Advances in Neural …, 2024 - proceedings.neurips.cc
Segmenting untrimmed videos into chapters enables users to quickly navigate to the
information of their interest. This important topic has been understudied due to the lack of …

Proposal-based multiple instance learning for weakly-supervised temporal action localization

H Ren, W Yang, T Zhang… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Weakly-supervised temporal action localization aims to localize and recognize actions in
untrimmed videos with only video-level category labels during training. Without instance …

Dyfadet: Dynamic feature aggregation for temporal action detection

L Yang, Z Zheng, Y Han, H Cheng, S Song… - … on Computer Vision, 2024 - Springer
Recent proposed neural network-based Temporal Action Detection (TAD) models are
inherently limited to extracting the discriminative representations and modeling action …

Re2TAL: Rewiring pretrained video backbones for reversible temporal action localization

C Zhao, S Liu, K Mangalam… - Proceedings of the …, 2023 - openaccess.thecvf.com
Temporal action localization (TAL) requires long-form reasoning to predict actions of various
durations and complex content. Given limited GPU memory, training TAL end to end (ie, from …

A simple llm framework for long-range video question-answering

C Zhang, T Lu, MM Islam, Z Wang, S Yu… - arxiv preprint arxiv …, 2023 - arxiv.org
We present LLoVi, a language-based framework for long-range video question-answering
(LVQA). Unlike prior long-range video understanding methods, which are often costly and …