Timechat: A time-sensitive multimodal large language model for long video understanding
This work proposes TimeChat a time-sensitive multimodal large language model specifically
designed for long video understanding. Our model incorporates two key architectural …
designed for long video understanding. Our model incorporates two key architectural …
Video-mined task graphs for keystep recognition in instructional videos
Procedural activity understanding requires perceiving human actions in terms of a broader
task, where multiple keysteps are performed in sequence across a long video to reach a …
task, where multiple keysteps are performed in sequence across a long video to reach a …
Learning object state changes in videos: An open-world perspective
Abstract Object State Changes (OSCs) are pivotal for video understanding. While humans
can effortlessly generalize OSC understanding from familiar to unknown objects current …
can effortlessly generalize OSC understanding from familiar to unknown objects current …
Genhowto: Learning to generate actions and state transformations from instructional videos
We address the task of generating temporally consistent and physically plausible images of
actions and object state transformations. Given an input image and a text prompt describing …
actions and object state transformations. Given an input image and a text prompt describing …
Genhowto: Learning to generate actions and state transformations from instructional videos
We address the task of generating temporally consistent and physically plausible images of
actions and object state transformations. Given an input image and a text prompt describing …
actions and object state transformations. Given an input image and a text prompt describing …
Multi-sentence Grounding for Long-Term Instructional Video
In this paper, we aim to establish an automatic, scalable pipeline for denoising the large-
scale instructional dataset and construct a high-quality video-text dataset with multiple …
scale instructional dataset and construct a high-quality video-text dataset with multiple …
Visual-semantic Alignment Temporal Parsing for Action Quality Assessment
Action Quality Assessment (AQA) is a challenging task involving analyzing fine-grained
technical subactions, aligning high-level visual-semantic representations, and exploring …
technical subactions, aligning high-level visual-semantic representations, and exploring …
Steps: Self-supervised key step extraction and localization from unlabeled procedural videos
A Shah, B Lundell, H Sawhney… - Proceedings of the …, 2023 - openaccess.thecvf.com
We address the problem of extracting key steps from unlabeled procedural videos,
motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training …
motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training …
Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment
Learning to localize temporal boundaries of procedure steps in instructional videos is
challenging due to the limited availability of annotated large-scale training videos. Recent …
challenging due to the limited availability of annotated large-scale training videos. Recent …
Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos
In this paper we explore the capability of an agent to construct a logical sequence of action
steps thereby assembling a strategic procedural plan. This plan is crucial for navigating from …
steps thereby assembling a strategic procedural plan. This plan is crucial for navigating from …