Memory Storyboard: Leveraging Temporal Segmentation for Streaming Self-Supervised Learning from Egocentric Videos
Self-supervised learning holds the promise to learn good representations from real-world
continuous uncurated data streams. However, most existing works in visual self-supervised …
continuous uncurated data streams. However, most existing works in visual self-supervised …
Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR
Recent advancements in multimodal large language models (MLLMs) have made significant
progress in integrating information across various modalities, yet real-world applications in …
progress in integrating information across various modalities, yet real-world applications in …
KISA: A Unified Keyframe Identifier and Skill Annotator for Long-Horizon Robotics Demonstrations
Robotic manipulation tasks often span over long horizons and encapsulate multiple
subtasks with different skills. Learning policies directly from long-horizon demonstrations is …
subtasks with different skills. Learning policies directly from long-horizon demonstrations is …
TinyMem: Condensing Multimodal Memory for Long-form Video Action Detection
Despite the great advances in video understanding with deep neural networks, current
solutions still struggle with input videos that last for minutes, if not hours. To mitigate this …
solutions still struggle with input videos that last for minutes, if not hours. To mitigate this …