- Academic Search

Memory Storyboard: Leveraging Temporal Segmentation for Streaming Self-Supervised Learning from Egocentric Videos

Y Yang, M Ren - arxiv preprint arxiv:2501.12254, 2025 - arxiv.org

Self-supervised learning holds the promise to learn good representations from real-world
continuous uncurated data streams. However, most existing works in visual self-supervised …

Enregistrer Citer Autres articles Les 2 versions Free GPT-4 DeepSeek Version HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

M Wang, Y Wang, TT Vu, E Shareghi… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent advancements in multimodal large language models (MLLMs) have made significant
progress in integrating information across various modalities, yet real-world applications in …

Enregistrer Citer Cité 1 fois Autres articles Version HTML

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

KISA: A Unified Keyframe Identifier and Skill Annotator for Long-Horizon Robotics Demonstrations

L Kou, F Ni, Y Zheng, J Liu, Y Yuan, Z Dong… - Forty-first International … - openreview.net

Robotic manipulation tasks often span over long horizons and encapsulate multiple
subtasks with different skills. Learning policies directly from long-horizon demonstrations is …

Enregistrer Citer Cité 3 fois Autres articles Version HTML

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

TinyMem: Condensing Multimodal Memory for Long-form Video Action Detection

R Tian, Q Dai, H Hu, Z Wu - openreview.net

Despite the great advances in video understanding with deep neural networks, current
solutions still struggle with input videos that last for minutes, if not hours. To mitigate this …

Enregistrer Citer Autres articles Version HTML

Créer l'alerte

Citer

Recherche avancée

Enregistré dans Ma bibliothèque

Revisiting kernel temporal segmentation as an adaptive tokenizer for long-form video understanding

Memory Storyboard: Leveraging Temporal Segmentation for Streaming Self-Supervised Learning from Egocentric Videos

Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

KISA: A Unified Keyframe Identifier and Skill Annotator for Long-Horizon Robotics Demonstrations

TinyMem: Condensing Multimodal Memory for Long-form Video Action Detection