Learning by Aligning 2D Skeleton Sequences and Multi-modality Fusion

QH Tran, M Ahmed, M Popattia, MH Ahmed… - … on Computer Vision, 2024 - Springer
This paper presents a self-supervised temporal video alignment framework which is useful
for several fine-grained human activity understanding applications. In contrast with the state …

Video LLMs for Temporal Reasoning in Long Videos

FJ Fateh, U Ahmed, H Khan, MZ Zia… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper introduces TemporalVLM, a video large language model capable of effective
temporal reasoning and fine-grained understanding in long videos. At the core, our …

Understanding via Gaze: Gaze-based Task Decomposition for Imitation Learning of Robot Manipulation

R Takizawa, Y Ohmura, Y Kuniyoshi - arxiv preprint arxiv:2501.15071, 2025 - arxiv.org
In imitation learning for robotic manipulation, decomposing object manipulation tasks into
multiple semantic actions is essential. This decomposition enables the reuse of learned …