Rlipv2: Fast scaling of relational language-image pre-training
Abstract Relational Language-Image Pre-training (RLIP) aims to align vision representations
with relational texts, thereby advancing the capability of relational reasoning in computer …
with relational texts, thereby advancing the capability of relational reasoning in computer …
Open-world human-object interaction detection via multi-modal prompts
In this paper we develop MP-HOI a powerful Multi-modal Prompt-based HOI detector
designed to leverage both textual descriptions for open-set generalization and visual …
designed to leverage both textual descriptions for open-set generalization and visual …
Scene-graph vit: End-to-end open-vocabulary visual relationship detection
Visual relationship detection aims to identify objects and their relationships in images. Prior
methods approach this task by adding separate relationship modules or decoders to existing …
methods approach this task by adding separate relationship modules or decoders to existing …
Towards Flexible Visual Relationship Segmentation
Visual relationship understanding has been studied separately in human-object interaction
(HOI) detection, scene graph generation (SGG), and referring relationships (RR) tasks …
(HOI) detection, scene graph generation (SGG), and referring relationships (RR) tasks …
From easy to hard: Learning curricular shape-aware features for robust panoptic scene graph generation
Abstract Panoptic Scene Graph Generation (PSG) aims to generate a comprehensive graph-
structure representation based on panoptic segmentation masks. Despite remarkable …
structure representation based on panoptic segmentation masks. Despite remarkable …
Toward open-set human object interaction detection
This work is oriented toward the task of open-set Human Object Interaction (HOI) detection.
The challenge lies in identifying completely new, out-of-domain relationships, as opposed to …
The challenge lies in identifying completely new, out-of-domain relationships, as opposed to …
RelationLMM: Large Multimodal Model as Open and Versatile Visual Relationship Generalist
Visual relationships are crucial for visual perception and reasoning, and cover tasks like
Scene Graph Generation, Human-Object Interaction, and object affordance. Despite …
Scene Graph Generation, Human-Object Interaction, and object affordance. Despite …
Beyond Embeddings: The Promise of Visual Table in Visual Reasoning
Visual representation learning has been a cornerstone in computer vision, involving typical
forms such as visual embeddings, structural symbols, and text-based representations …
forms such as visual embeddings, structural symbols, and text-based representations …
Hydra-sgg: Hybrid relation assignment for one-stage scene graph generation
DETR introduces a simplified one-stage framework for scene graph generation (SGG).
However, DETR-based SGG models face two challenges: i) Sparse supervision, as each …
However, DETR-based SGG models face two challenges: i) Sparse supervision, as each …
Adaptive multimodal prompt for human-object interaction with local feature enhanced transformer
Human-object interaction (HOI) detection is an important computer vision task for
recognizing the interaction between humans and surrounding objects in an image or video …
recognizing the interaction between humans and surrounding objects in an image or video …