Mdetr-modulated detection for end-to-end multi-modal understanding

A Kamath, M Singh, Y LeCun… - Proceedings of the …, 2021 - openaccess.thecvf.com
Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of
interest from the image. However, this crucial module is typically used as a black box …

Debiased visual question answering from feature and sample perspectives

Z Wen, G Xu, M Tan, Q Wu… - Advances in Neural …, 2021 - proceedings.neurips.cc
Visual question answering (VQA) is designed to examine the visual-textual reasoning ability
of an intelligent agent. However, recent observations show that many VQA models may only …

Test-time model adaptation for visual question answering with debiased self-supervisions

Z Wen, S Niu, G Li, Q Wu, M Tan… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Visual question answering (VQA) is a prevalent task in real-world, and plays an essential
role in hel** the blind understand the physical world. However, due to the real-world …

Context disentangling and prototype inheriting for robust visual grounding

W Tang, L Li, X Liu, L **, J Tang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Visual grounding (VG) aims to locate a specific target in an image based on a given
language query. The discriminative information from context is important for distinguishing …

Transformer-based relational inference network for complex visual relational reasoning

M Tan, Z Wen, L Fang, Q Wu - ACM Transactions on Multimedia …, 2023 - dl.acm.org
Visual Relational Reasoning is the basis of many vision-and-language based tasks (eg,
visual question answering and referring expression comprehension). In this article, we …

Deep scene understanding with extended text description for human object interaction detection

HS Hong, JC Lee, A Kumar, S Ahn, DG Lee - Expert Systems with …, 2025 - Elsevier
Human–object interaction (HOI) detection plays a pivotal role in scene understanding,
enabling the identification, localization, and behavioral intention prediction of humans and …

Deep Scene Understanding with Extended Text Description for Human

DG Lee - Available at SSRN 4705624 - papers.ssrn.com
Human-object interaction (HOI) detection plays a pivotal role in scene understanding,
enabling the identification, localization, and behavioral intention prediction of humans and …