Grounded sam: Assembling open-world models for diverse visual tasks
We introduce Grounded SAM, which uses Grounding DINO as an open-set object detector to
combine with the segment anything model (SAM). This integration enables the detection and …
combine with the segment anything model (SAM). This integration enables the detection and …
Skeletonmae: graph-based masked autoencoder for skeleton sequence pre-training
Skeleton sequence representation learning has shown great advantages for action
recognition due to its promising ability to model human joints and topology. However, the …
recognition due to its promising ability to model human joints and topology. However, the …
Deep learning technique for human parsing: A survey and outlook
Human parsing aims to partition humans in image or video into multiple pixel-level semantic
parts. In the last decade, it has gained significantly increased interest in the computer vision …
parts. In the last decade, it has gained significantly increased interest in the computer vision …
Humanmac: Masked motion completion for human motion prediction
Human motion prediction is a classical problem in computer vision and computer graphics,
which has a wide range of practical applications. Previous effects achieve great empirical …
which has a wide range of practical applications. Previous effects achieve great empirical …
F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
Existing 3D human object interaction (HOI) datasets and models simply align global
descriptions with the long HOI sequence, while lacking a detailed understanding of …
descriptions with the long HOI sequence, while lacking a detailed understanding of …
Unified Human-centric Model, Framework and Benchmark: A Survey
Human-centric Computer Vision Tasks (HCTs) refer to a series of tasks related to the human
body, such as Human Pose Estimation, Pedestrian Tracking, Re-Identification (ReID) …
body, such as Human Pose Estimation, Pedestrian Tracking, Re-Identification (ReID) …
X-pose: Detecting any keypoints
This work aims to address an advanced keypoint detection problem: how to accurately
detect any keypoints in complex real-world scenarios, which involves massive, messy, and …
detect any keypoints in complex real-world scenarios, which involves massive, messy, and …
OLAF: A Plug-and-Play Framework for Enhanced Multi-object Multi-part Scene Parsing
Multi-object multi-part scene segmentation is a challenging task whose complexity scales
exponentially with part granularity and number of scene objects. To address the task, we …
exponentially with part granularity and number of scene objects. To address the task, we …
From Simple to Complex Scenes: Learning Robust Feature Representations for Accurate Human Parsing
Human parsing has attracted considerable research interest due to its broad potential
applications in the computer vision community. In this paper, we explore several useful …
applications in the computer vision community. In this paper, we explore several useful …
KptLLM: Unveiling the Power of Large Language Model for Keypoint Comprehension
Recent advancements in Multimodal Large Language Models (MLLMs) have greatly
improved their abilities in image understanding. However, these models often struggle with …
improved their abilities in image understanding. However, these models often struggle with …