Videomae v2: Scaling video masked autoencoders with dual masking
Scale is the primary factor for building a powerful foundation model that could well
generalize to a variety of downstream tasks. However, it is still challenging to train video …
generalize to a variety of downstream tasks. However, it is still challenging to train video …
Vid2seq: Large-scale pretraining of a visual language model for dense video captioning
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
Videocomposer: Compositional video synthesis with motion controllability
The pursuit of controllability as a higher standard of visual content creation has yielded
remarkable progress in customizable image synthesis. However, achieving controllable …
remarkable progress in customizable image synthesis. However, achieving controllable …
Tridet: Temporal action detection with relative boundary modeling
In this paper, we present a one-stage framework TriDet for temporal action detection.
Existing methods often suffer from imprecise boundary predictions due to the ambiguous …
Existing methods often suffer from imprecise boundary predictions due to the ambiguous …
Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering
Abstract To build Video Question Answering (VideoQA) systems capable of assisting
humans in daily activities, seeking answers from long-form videos with diverse and complex …
humans in daily activities, seeking answers from long-form videos with diverse and complex …
Vidchapters-7m: Video chapters at scale
Segmenting untrimmed videos into chapters enables users to quickly navigate to the
information of their interest. This important topic has been understudied due to the lack of …
information of their interest. This important topic has been understudied due to the lack of …
Proposal-based multiple instance learning for weakly-supervised temporal action localization
Weakly-supervised temporal action localization aims to localize and recognize actions in
untrimmed videos with only video-level category labels during training. Without instance …
untrimmed videos with only video-level category labels during training. Without instance …
Dyfadet: Dynamic feature aggregation for temporal action detection
Recent proposed neural network-based Temporal Action Detection (TAD) models are
inherently limited to extracting the discriminative representations and modeling action …
inherently limited to extracting the discriminative representations and modeling action …
Re2TAL: Rewiring pretrained video backbones for reversible temporal action localization
Temporal action localization (TAL) requires long-form reasoning to predict actions of various
durations and complex content. Given limited GPU memory, training TAL end to end (ie, from …
durations and complex content. Given limited GPU memory, training TAL end to end (ie, from …
A simple llm framework for long-range video question-answering
We present LLoVi, a language-based framework for long-range video question-answering
(LVQA). Unlike prior long-range video understanding methods, which are often costly and …
(LVQA). Unlike prior long-range video understanding methods, which are often costly and …