Temporal action segmentation: An analysis of modern techniques

G Ding, F Sener, A Yao - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Temporal action segmentation (TAS) in videos aims at densely identifying video frames in
minutes-long videos with multiple action classes. As a long-range video understanding task …

Egoschema: A diagnostic benchmark for very long-form video language understanding

K Mangalam, R Akshulakov… - Advances in Neural …, 2023 - proceedings.neurips.cc
We introduce EgoSchema, a very long-form video question-answering dataset, and
benchmark to evaluate long video understanding capabilities of modern vision and …

Moviechat: From dense token to sparse memory for long video understanding

E Song, W Chai, G Wang, Y Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

Anticipative video transformer

R Girdhar, K Grauman - Proceedings of the IEEE/CVF …, 2021 - openaccess.thecvf.com
Abstract We propose Anticipative Video Transformer (AVT), an end-to-end attention-based
video modeling architecture that attends to the previously observed video in order to …

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Knowing where to focus: Event-aware transformer for video grounding

J Jang, J Park, J Kim, H Kwon… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Recent DETR-based video grounding models have made the model directly predict moment
timestamps without any hand-crafted components, such as a pre-defined proposal or non …

Equivariant similarity for vision-language foundation models

T Wang, K Lin, L Li, CC Lin, Z Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com
This study explores the concept of equivariance in vision-language foundation models
(VLMs), focusing specifically on the multimodal similarity function that is not only the major …

STREAMER: Streaming representation learning and event segmentation in a hierarchical manner

R Mounir, S Vijayaraghavan… - Advances in Neural …, 2023 - proceedings.neurips.cc
We present a novel self-supervised approach for hierarchical representation learning and
segmentation of perceptual inputs in a streaming fashion. Our research addresses how to …

Movqa: A benchmark of versatile question-answering for long-form movie understanding

H Zhang, Y Liu, L Dong, Y Huang, ZH Ling… - arxiv preprint arxiv …, 2023 - arxiv.org
While several long-form VideoQA datasets have been introduced, the length of both videos
used to curate questions and sub-clips of clues leveraged to answer those questions have …

Newsnet: A novel dataset for hierarchical temporal segmentation

H Wu, K Chen, H Liu, M Zhuge, B Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
Temporal video segmentation is the get-to-go automatic video analysis, which decomposes
a long-form video into smaller components for the following-up understanding tasks. Recent …