Learning by Aligning 2D Skeleton Sequences and Multi-modality Fusion
This paper presents a self-supervised temporal video alignment framework which is useful
for several fine-grained human activity understanding applications. In contrast with the state …
for several fine-grained human activity understanding applications. In contrast with the state …
Video LLMs for Temporal Reasoning in Long Videos
This paper introduces TemporalVLM, a video large language model capable of effective
temporal reasoning and fine-grained understanding in long videos. At the core, our …
temporal reasoning and fine-grained understanding in long videos. At the core, our …
Understanding via Gaze: Gaze-based Task Decomposition for Imitation Learning of Robot Manipulation
R Takizawa, Y Ohmura, Y Kuniyoshi - arxiv preprint arxiv:2501.15071, 2025 - arxiv.org
In imitation learning for robotic manipulation, decomposing object manipulation tasks into
multiple semantic actions is essential. This decomposition enables the reuse of learned …
multiple semantic actions is essential. This decomposition enables the reuse of learned …