Temporal action segmentation: An analysis of modern techniques
Temporal action segmentation (TAS) in videos aims at densely identifying video frames in
minutes-long videos with multiple action classes. As a long-range video understanding task …
minutes-long videos with multiple action classes. As a long-range video understanding task …
Egoschema: A diagnostic benchmark for very long-form video language understanding
We introduce EgoSchema, a very long-form video question-answering dataset, and
benchmark to evaluate long video understanding capabilities of modern vision and …
benchmark to evaluate long video understanding capabilities of modern vision and …
Moviechat: From dense token to sparse memory for long video understanding
Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …
Anticipative video transformer
Abstract We propose Anticipative Video Transformer (AVT), an end-to-end attention-based
video modeling architecture that attends to the previously observed video in order to …
video modeling architecture that attends to the previously observed video in order to …
Video understanding with large language models: A survey
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …
content, the demand for proficient video understanding tools has intensified markedly. Given …
Knowing where to focus: Event-aware transformer for video grounding
Recent DETR-based video grounding models have made the model directly predict moment
timestamps without any hand-crafted components, such as a pre-defined proposal or non …
timestamps without any hand-crafted components, such as a pre-defined proposal or non …
Equivariant similarity for vision-language foundation models
This study explores the concept of equivariance in vision-language foundation models
(VLMs), focusing specifically on the multimodal similarity function that is not only the major …
(VLMs), focusing specifically on the multimodal similarity function that is not only the major …
STREAMER: Streaming representation learning and event segmentation in a hierarchical manner
We present a novel self-supervised approach for hierarchical representation learning and
segmentation of perceptual inputs in a streaming fashion. Our research addresses how to …
segmentation of perceptual inputs in a streaming fashion. Our research addresses how to …
Movqa: A benchmark of versatile question-answering for long-form movie understanding
While several long-form VideoQA datasets have been introduced, the length of both videos
used to curate questions and sub-clips of clues leveraged to answer those questions have …
used to curate questions and sub-clips of clues leveraged to answer those questions have …
Newsnet: A novel dataset for hierarchical temporal segmentation
Temporal video segmentation is the get-to-go automatic video analysis, which decomposes
a long-form video into smaller components for the following-up understanding tasks. Recent …
a long-form video into smaller components for the following-up understanding tasks. Recent …