- Academic Search

G Ding, F Sener, A Yao - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Temporal action segmentation (TAS) in videos aims at densely identifying video frames in
minutes-long videos with multiple action classes. As a long-range video understanding task …

Gem Citer Citeret af 70 Relaterede artikler Alle 8 versioner

[Free GPT-4]

[PDF] neurips.cc

Egoschema: A diagnostic benchmark for very long-form video language understanding

K Mangalam, R Akshulakov… - Advances in Neural …, 2023 - proceedings.neurips.cc

We introduce EgoSchema, a very long-form video question-answering dataset, and
benchmark to evaluate long video understanding capabilities of modern vision and …

Gem Citer Citeret af 164 Relaterede artikler Alle 5 versioner Vis som HTML

[Free GPT-4]

[PDF] thecvf.com

Moviechat: From dense token to sparse memory for long video understanding

E Song, W Chai, G Wang, Y Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

Gem Citer Citeret af 179 Relaterede artikler Alle 3 versioner Vis som HTML

[Free GPT-4]

[PDF] thecvf.com

Anticipative video transformer

R Girdhar, K Grauman - Proceedings of the IEEE/CVF …, 2021 - openaccess.thecvf.com

Abstract We propose Anticipative Video Transformer (AVT), an end-to-end attention-based
video modeling architecture that attends to the previously observed video in order to …

Gem Citer Citeret af 250 Relaterede artikler Alle 6 versioner Vis som HTML

[Free GPT-4]

[PDF] arxiv.org

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Gem Citer Citeret af 60 Relaterede artikler Alle 2 versioner Vis som HTML

[Free GPT-4]

[PDF] thecvf.com

Knowing where to focus: Event-aware transformer for video grounding

J Jang, J Park, J Kim, H Kwon… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Recent DETR-based video grounding models have made the model directly predict moment
timestamps without any hand-crafted components, such as a pre-defined proposal or non …

Gem Citer Citeret af 51 Relaterede artikler Alle 8 versioner Vis som HTML

[Free GPT-4]

[PDF] thecvf.com

Equivariant similarity for vision-language foundation models

T Wang, K Lin, L Li, CC Lin, Z Yang… - Proceedings of the …, 2023 - openaccess.thecvf.com

This study explores the concept of equivariance in vision-language foundation models
(VLMs), focusing specifically on the multimodal similarity function that is not only the major …

Gem Citer Citeret af 28 Relaterede artikler Alle 5 versioner Vis som HTML

[Free GPT-4]

[PDF] neurips.cc

STREAMER: Streaming representation learning and event segmentation in a hierarchical manner

R Mounir, S Vijayaraghavan… - Advances in Neural …, 2023 - proceedings.neurips.cc

We present a novel self-supervised approach for hierarchical representation learning and
segmentation of perceptual inputs in a streaming fashion. Our research addresses how to …

Gem Citer Citeret af 7 Relaterede artikler Alle 3 versioner Vis som HTML

[Free GPT-4]

[PDF] arxiv.org

Movqa: A benchmark of versatile question-answering for long-form movie understanding

H Zhang, Y Liu, L Dong, Y Huang, ZH Ling… - arxiv preprint arxiv …, 2023 - arxiv.org

While several long-form VideoQA datasets have been introduced, the length of both videos
used to curate questions and sub-clips of clues leveraged to answer those questions have …

Gem Citer Citeret af 17 Relaterede artikler Alle 2 versioner Vis som HTML

[Free GPT-4]

[PDF] thecvf.com

Newsnet: A novel dataset for hierarchical temporal segmentation

H Wu, K Chen, H Liu, M Zhuge, B Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

Temporal video segmentation is the get-to-go automatic video analysis, which decomposes
a long-form video into smaller components for the following-up understanding tasks. Recent …

Gem Citer Citeret af 8 Relaterede artikler Alle 7 versioner Vis som HTML

Opret underretning

Citer

Avanceret søgning

Gemt i Min samling

Generic event boundary detection: A benchmark for event segmentation

Temporal action segmentation: An analysis of modern techniques

Egoschema: A diagnostic benchmark for very long-form video language understanding

Moviechat: From dense token to sparse memory for long video understanding

Anticipative video transformer

Video understanding with large language models: A survey

Knowing where to focus: Event-aware transformer for video grounding

Equivariant similarity for vision-language foundation models

STREAMER: Streaming representation learning and event segmentation in a hierarchical manner

Movqa: A benchmark of versatile question-answering for long-form movie understanding

Newsnet: A novel dataset for hierarchical temporal segmentation