Google Acadèmic

S Ren, L Yao, S Li, X Sun… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

This work proposes TimeChat a time-sensitive multimodal large language model specifically
designed for long video understanding. Our model incorporates two key architectural …

Desa Cita Citat per 128 Articles relacionats Totes les 4 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Video-mined task graphs for keystep recognition in instructional videos

K Ashutosh, SK Ramakrishnan… - Advances in Neural …, 2024 - proceedings.neurips.cc

Procedural activity understanding requires perceiving human actions in terms of a broader
task, where multiple keysteps are performed in sequence across a long video to reach a …

Desa Cita Citat per 23 Articles relacionats Totes les 6 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Learning object state changes in videos: An open-world perspective

Z Xue, K Ashutosh, K Grauman - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Abstract Object State Changes (OSCs) are pivotal for video understanding. While humans
can effortlessly generalize OSC understanding from familiar to unknown objects current …

Desa Cita Citat per 15 Articles relacionats Totes les 4 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Genhowto: Learning to generate actions and state transformations from instructional videos

T Souček, D Damen, M Wray… - Proceedings of the …, 2024 - openaccess.thecvf.com

We address the task of generating temporally consistent and physically plausible images of
actions and object state transformations. Given an input image and a text prompt describing …

Desa Cita Citat per 2 Articles relacionats Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Genhowto: Learning to generate actions and state transformations from instructional videos

T Souček, D Damen, M Wray, I Laptev… - 2024 IEEE/CVF …, 2024 - ieeexplore.ieee.org

We address the task of generating temporally consistent and physically plausible images of
actions and object state transformations. Given an input image and a text prompt describing …

Desa Cita Citat per 13 Articles relacionats Totes les 3 versions Free GPT-4 DeepSeek

[Free GPT-4]
[DeepSeek]

[PDF] pkwyx.com

Multi-sentence Grounding for Long-Term Instructional Video

Z Li, Q Chen, T Han, Y Zhang, Y Wang… - European Conference on …, 2024 - Springer

In this paper, we aim to establish an automatic, scalable pipeline for denoising the large-
scale instructional dataset and construct a high-quality video-text dataset with multiple …

Desa Cita Citat per 2 Articles relacionats Totes les 6 versions Free GPT-4 DeepSeek

Visual-semantic Alignment Temporal Parsing for Action Quality Assessment

K Gedamu, Y Ji, Y Yang, J Shao… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

Action Quality Assessment (AQA) is a challenging task involving analyzing fine-grained
technical subactions, aligning high-level visual-semantic representations, and exploring …

Desa Cita Citat per 2 Articles relacionats

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Steps: Self-supervised key step extraction and localization from unlabeled procedural videos

A Shah, B Lundell, H Sawhney… - Proceedings of the …, 2023 - openaccess.thecvf.com

We address the problem of extracting key steps from unlabeled procedural videos,
motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training …

Desa Cita Citat per 8 Articles relacionats Totes les 7 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment

Y Chen, K Li, W Bao, D Patel, Y Kong, MR Min… - … on Computer Vision, 2024 - Springer

Learning to localize temporal boundaries of procedure steps in instructional videos is
challenging due to the limited availability of annotated large-scale training videos. Recent …

Desa Cita Citat per 1 Articles relacionats Totes les 11 versions Free GPT-4 DeepSeek

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos

KRY Nagasinghe, H Zhou… - Proceedings of the …, 2024 - openaccess.thecvf.com

In this paper we explore the capability of an agent to construct a logical sequence of action
steps thereby assembling a strategic procedural plan. This plan is crucial for navigating from …

Desa Cita Citat per 3 Articles relacionats Totes les 2 versions Free GPT-4 DeepSeek Versió HTML

Crea una alerta

Cita

Cerca avançada

S'ha desat a La meva biblioteca

Stepformer: Self-supervised step discovery and localization in instructional videos

Timechat: A time-sensitive multimodal large language model for long video understanding

Video-mined task graphs for keystep recognition in instructional videos

Learning object state changes in videos: An open-world perspective

Genhowto: Learning to generate actions and state transformations from instructional videos

Genhowto: Learning to generate actions and state transformations from instructional videos

Multi-sentence Grounding for Long-Term Instructional Video

Visual-semantic Alignment Temporal Parsing for Action Quality Assessment

Steps: Self-supervised key step extraction and localization from unlabeled procedural videos

Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment

Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos