Rlipv2: Fast scaling of relational language-image pre-training

H Yuan, S Zhang, X Wang, S Albanie… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Relational Language-Image Pre-training (RLIP) aims to align vision representations
with relational texts, thereby advancing the capability of relational reasoning in computer …

Open-world human-object interaction detection via multi-modal prompts

J Yang, B Li, A Zeng, L Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
In this paper we develop MP-HOI a powerful Multi-modal Prompt-based HOI detector
designed to leverage both textual descriptions for open-set generalization and visual …

Scene-graph vit: End-to-end open-vocabulary visual relationship detection

T Salzmann, M Ryll, A Bewley, M Minderer - European Conference on …, 2024 - Springer
Visual relationship detection aims to identify objects and their relationships in images. Prior
methods approach this task by adding separate relationship modules or decoders to existing …

Towards Flexible Visual Relationship Segmentation

F Zhu, J Yang, H Jiang - Advances in Neural Information …, 2025 - proceedings.neurips.cc
Visual relationship understanding has been studied separately in human-object interaction
(HOI) detection, scene graph generation (SGG), and referring relationships (RR) tasks …

From easy to hard: Learning curricular shape-aware features for robust panoptic scene graph generation

H Shi, L Li, J **ao, Y Zhuang, L Chen - International Journal of Computer …, 2024 - Springer
Abstract Panoptic Scene Graph Generation (PSG) aims to generate a comprehensive graph-
structure representation based on panoptic segmentation masks. Despite remarkable …

Toward open-set human object interaction detection

M Wu, Y Liu, J Ji, X Sun, R Ji - Proceedings of the AAAI Conference on …, 2024 - ojs.aaai.org
This work is oriented toward the task of open-set Human Object Interaction (HOI) detection.
The challenge lies in identifying completely new, out-of-domain relationships, as opposed to …

RelationLMM: Large Multimodal Model as Open and Versatile Visual Relationship Generalist

C **e, S Liang, J Li, Z Zhang, F Zhu… - IEEE Transactions on …, 2025 - ieeexplore.ieee.org
Visual relationships are crucial for visual perception and reasoning, and cover tasks like
Scene Graph Generation, Human-Object Interaction, and object affordance. Despite …

Beyond Embeddings: The Promise of Visual Table in Visual Reasoning

Y Zhong, ZY Hu, MR Lyu, L Wang - arxiv preprint arxiv:2403.18252, 2024 - arxiv.org
Visual representation learning has been a cornerstone in computer vision, involving typical
forms such as visual embeddings, structural symbols, and text-based representations …

Hydra-sgg: Hybrid relation assignment for one-stage scene graph generation

M Chen, G Chen, W Wang, Y Yang - arxiv preprint arxiv:2409.10262, 2024 - arxiv.org
DETR introduces a simplified one-stage framework for scene graph generation (SGG).
However, DETR-based SGG models face two challenges: i) Sparse supervision, as each …

Adaptive multimodal prompt for human-object interaction with local feature enhanced transformer

K Xue, Y Gao, Z Fang, X Jiang, W Yu, M Chen, C Wu - Applied Intelligence, 2024 - Springer
Human-object interaction (HOI) detection is an important computer vision task for
recognizing the interaction between humans and surrounding objects in an image or video …