Timechat: A time-sensitive multimodal large language model for long video understanding

S Ren, L Yao, S Li, X Sun… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
This work proposes TimeChat a time-sensitive multimodal large language model specifically
designed for long video understanding. Our model incorporates two key architectural …

Video-mined task graphs for keystep recognition in instructional videos

K Ashutosh, SK Ramakrishnan… - Advances in Neural …, 2024 - proceedings.neurips.cc
Procedural activity understanding requires perceiving human actions in terms of a broader
task, where multiple keysteps are performed in sequence across a long video to reach a …

Learning object state changes in videos: An open-world perspective

Z Xue, K Ashutosh, K Grauman - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract Object State Changes (OSCs) are pivotal for video understanding. While humans
can effortlessly generalize OSC understanding from familiar to unknown objects current …

Genhowto: Learning to generate actions and state transformations from instructional videos

T Souček, D Damen, M Wray… - Proceedings of the …, 2024 - openaccess.thecvf.com
We address the task of generating temporally consistent and physically plausible images of
actions and object state transformations. Given an input image and a text prompt describing …

Genhowto: Learning to generate actions and state transformations from instructional videos

T Souček, D Damen, M Wray, I Laptev… - 2024 IEEE/CVF …, 2024 - ieeexplore.ieee.org
We address the task of generating temporally consistent and physically plausible images of
actions and object state transformations. Given an input image and a text prompt describing …

Multi-sentence Grounding for Long-Term Instructional Video

Z Li, Q Chen, T Han, Y Zhang, Y Wang… - European Conference on …, 2024 - Springer
In this paper, we aim to establish an automatic, scalable pipeline for denoising the large-
scale instructional dataset and construct a high-quality video-text dataset with multiple …

Visual-semantic Alignment Temporal Parsing for Action Quality Assessment

K Gedamu, Y Ji, Y Yang, J Shao… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Action Quality Assessment (AQA) is a challenging task involving analyzing fine-grained
technical subactions, aligning high-level visual-semantic representations, and exploring …

Steps: Self-supervised key step extraction and localization from unlabeled procedural videos

A Shah, B Lundell, H Sawhney… - Proceedings of the …, 2023 - openaccess.thecvf.com
We address the problem of extracting key steps from unlabeled procedural videos,
motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training …

Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment

Y Chen, K Li, W Bao, D Patel, Y Kong, MR Min… - … on Computer Vision, 2024 - Springer
Learning to localize temporal boundaries of procedure steps in instructional videos is
challenging due to the limited availability of annotated large-scale training videos. Recent …

Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos

KRY Nagasinghe, H Zhou… - Proceedings of the …, 2024 - openaccess.thecvf.com
In this paper we explore the capability of an agent to construct a logical sequence of action
steps thereby assembling a strategic procedural plan. This plan is crucial for navigating from …