Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

K Grauman, A Westbury, L Torresani… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …

Videollm-online: Online video large language model for streaming video

J Chen, Z Lv, S Wu, KQ Lin, C Song… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Large Language Models (LLMs) have been enhanced with vision capabilities
enabling them to comprehend images videos and interleaved vision-language content …

Procedure-aware surgical video-language pretraining with hierarchical knowledge augmentation

K Yuan, N Navab, N Padoy - Advances in Neural …, 2025 - proceedings.neurips.cc
Surgical video-language pretraining (VLP) faces unique challenges due to the knowledge
domain gap and the scarcity of multi-modal data. This study aims to bridge the gap by …

Learning fine-grained view-invariant representations from unpaired ego-exo videos via temporal alignment

ZS Xue, K Grauman - Advances in Neural Information …, 2023 - proceedings.neurips.cc
The egocentric and exocentric viewpoints of a human activity look dramatically different, yet
invariant representations to link them are essential for many potential applications in …

Video-mined task graphs for keystep recognition in instructional videos

K Ashutosh, SK Ramakrishnan… - Advances in Neural …, 2023 - proceedings.neurips.cc
Procedural activity understanding requires perceiving human actions in terms of a broader
task, where multiple keysteps are performed in sequence across a long video to reach a …

Progress-aware online action segmentation for egocentric procedural task videos

Y Shen, E Elhamifar - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
We address the problem of online action segmentation for egocentric procedural task
videos. While previous studies have mostly focused on offline action segmentation where …

Ht-step: Aligning instructional articles with how-to videos

T Afouras, E Mavroudi, T Nagarajan… - Advances in …, 2023 - proceedings.neurips.cc
We introduce HT-Step, a large-scale dataset containing temporal annotations of instructional
article steps in cooking videos. It includes 122k segment-level annotations over 20k narrated …

Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation

S Wu, J Chen, KQ Lin, Q Wang, Y Gao… - Advances in …, 2025 - proceedings.neurips.cc
A well-known dilemma in large vision-language models (eg, GPT-4, LLaVA) is that while
increasing the number of vision tokens generally enhances visual understanding, it also …

Learning to ground instructional articles in videos through narrations

E Mavroudi, T Afouras… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
In this paper we present an approach for localizing steps of procedural activities in narrated
how-to videos. To deal with the scarcity of labeled data at scale, we source the step …

Prism-q&a: Step-aware voice assistant on a smartwatch enabled by multimodal procedure tracking and large language models

R Arakawa, JF Lehman, M Goel - Proceedings of the ACM on Interactive …, 2024 - dl.acm.org
Voice assistants capable of answering user queries during various physical tasks have
shown promise in guiding users through complex procedures. However, users often find it …