Mdetr-modulated detection for end-to-end multi-modal understanding
Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of
interest from the image. However, this crucial module is typically used as a black box …
interest from the image. However, this crucial module is typically used as a black box …
Debiased visual question answering from feature and sample perspectives
Visual question answering (VQA) is designed to examine the visual-textual reasoning ability
of an intelligent agent. However, recent observations show that many VQA models may only …
of an intelligent agent. However, recent observations show that many VQA models may only …
Test-time model adaptation for visual question answering with debiased self-supervisions
Visual question answering (VQA) is a prevalent task in real-world, and plays an essential
role in hel** the blind understand the physical world. However, due to the real-world …
role in hel** the blind understand the physical world. However, due to the real-world …
Context disentangling and prototype inheriting for robust visual grounding
Visual grounding (VG) aims to locate a specific target in an image based on a given
language query. The discriminative information from context is important for distinguishing …
language query. The discriminative information from context is important for distinguishing …
Transformer-based relational inference network for complex visual relational reasoning
Visual Relational Reasoning is the basis of many vision-and-language based tasks (eg,
visual question answering and referring expression comprehension). In this article, we …
visual question answering and referring expression comprehension). In this article, we …
Deep scene understanding with extended text description for human object interaction detection
Human–object interaction (HOI) detection plays a pivotal role in scene understanding,
enabling the identification, localization, and behavioral intention prediction of humans and …
enabling the identification, localization, and behavioral intention prediction of humans and …
Deep Scene Understanding with Extended Text Description for Human
DG Lee - Available at SSRN 4705624 - papers.ssrn.com
Human-object interaction (HOI) detection plays a pivotal role in scene understanding,
enabling the identification, localization, and behavioral intention prediction of humans and …
enabling the identification, localization, and behavioral intention prediction of humans and …