Google Academic

K Grauman, A Westbury, L Torresani… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …

Salvați Citați Citat de 137 ori Articole cu conținut similar Toate cele 11 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Videollm-online: Online video large language model for streaming video

J Chen, Z Lv, S Wu, KQ Lin, C Song… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Large Language Models (LLMs) have been enhanced with vision capabilities
enabling them to comprehend images videos and interleaved vision-language content …

Salvați Citați Citat de 27 ori Articole cu conținut similar Toate cele 8 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Procedure-aware surgical video-language pretraining with hierarchical knowledge augmentation

K Yuan, N Navab, N Padoy - Advances in Neural …, 2025 - proceedings.neurips.cc

Surgical video-language pretraining (VLP) faces unique challenges due to the knowledge
domain gap and the scarcity of multi-modal data. This study aims to bridge the gap by …

Salvați Citați Citat de 6 ori Articole cu conținut similar Toate cele 3 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Learning fine-grained view-invariant representations from unpaired ego-exo videos via temporal alignment

ZS Xue, K Grauman - Advances in Neural Information …, 2023 - proceedings.neurips.cc

The egocentric and exocentric viewpoints of a human activity look dramatically different, yet
invariant representations to link them are essential for many potential applications in …

Salvați Citați Citat de 27 ori Articole cu conținut similar Toate cele 7 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Video-mined task graphs for keystep recognition in instructional videos

K Ashutosh, SK Ramakrishnan… - Advances in Neural …, 2023 - proceedings.neurips.cc

Procedural activity understanding requires perceiving human actions in terms of a broader
task, where multiple keysteps are performed in sequence across a long video to reach a …

Salvați Citați Citat de 23 ori Articole cu conținut similar Toate cele 7 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Progress-aware online action segmentation for egocentric procedural task videos

Y Shen, E Elhamifar - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com

We address the problem of online action segmentation for egocentric procedural task
videos. While previous studies have mostly focused on offline action segmentation where …

Salvați Citați Citat de 7 ori Articole cu conținut similar Toate cele 6 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Ht-step: Aligning instructional articles with how-to videos

T Afouras, E Mavroudi, T Nagarajan… - Advances in …, 2023 - proceedings.neurips.cc

We introduce HT-Step, a large-scale dataset containing temporal annotations of instructional
article steps in cooking videos. It includes 122k segment-level annotations over 20k narrated …

Salvați Citați Citat de 13 ori Articole cu conținut similar Toate cele 4 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation

S Wu, J Chen, KQ Lin, Q Wang, Y Gao… - Advances in …, 2025 - proceedings.neurips.cc

A well-known dilemma in large vision-language models (eg, GPT-4, LLaVA) is that while
increasing the number of vision tokens generally enhances visual understanding, it also …

Salvați Citați Citat de 3 ori Articole cu conținut similar Toate cele 4 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Learning to ground instructional articles in videos through narrations

E Mavroudi, T Afouras… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

In this paper we present an approach for localizing steps of procedural activities in narrated
how-to videos. To deal with the scarcity of labeled data at scale, we source the step …

Salvați Citați Citat de 16 ori Articole cu conținut similar Toate cele 7 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Prism-q&a: Step-aware voice assistant on a smartwatch enabled by multimodal procedure tracking and large language models

R Arakawa, JF Lehman, M Goel - Proceedings of the ACM on Interactive …, 2024 - dl.acm.org

Voice assistants capable of answering user queries during various physical tasks have
shown promise in guiding users through complex procedures. However, users often find it …

Salvați Citați Citat de 5 ori Articole cu conținut similar Toate cele 2 versiuni

Creează alerta

Citați

Căutare avansată

Salvat în Bibliotecă

Procedure-aware pretraining for instructional video understanding

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Videollm-online: Online video large language model for streaming video

Procedure-aware surgical video-language pretraining with hierarchical knowledge augmentation

Learning fine-grained view-invariant representations from unpaired ego-exo videos via temporal alignment

Video-mined task graphs for keystep recognition in instructional videos

Progress-aware online action segmentation for egocentric procedural task videos

Ht-step: Aligning instructional articles with how-to videos

Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation

Learning to ground instructional articles in videos through narrations

Prism-q&a: Step-aware voice assistant on a smartwatch enabled by multimodal procedure tracking and large language models