Semi-supervised panoptic narrative grounding
Despite considerable progress, the advancement of Panoptic Narrative Grounding (PNG)
remains hindered by costly annotations. In this paper, we introduce a novel Semi …
remains hindered by costly annotations. In this paper, we introduce a novel Semi …
Ppmn: Pixel-phrase matching network for one-stage panoptic narrative grounding
Panoptic Narrative Grounding (PNG) is an emerging task whose goal is to segment visual
objects of things and stuff categories described by dense narrative captions of a still image …
objects of things and stuff categories described by dense narrative captions of a still image …
Improving Panoptic Narrative Grounding by Harnessing Semantic Relationships and Visual Confirmation
Recent advancements in single-stage Panoptic Narrative Grounding (PNG) have
demonstrated significant potential. These methods predict pixel-level masks by directly …
demonstrated significant potential. These methods predict pixel-level masks by directly …
Integrating Neural-Symbolic Reasoning With Variational Causal Inference Network for Explanatory Visual Question Answering
Recently, a novel multimodal reasoning task named Explanatory Visual Question Answering
(EVQA) has been introduced, which combines answering visual questions with multimodal …
(EVQA) has been introduced, which combines answering visual questions with multimodal …
HumanFormer: Human-centric Prompting Multi-modal Perception Transformer for Referring Crowd Detection
As an important step towards crowd understanding referring crowd detection (RCD) aims to
locate the person in human crowded environments described by a natural language …
locate the person in human crowded environments described by a natural language …
Graph-based referring expression comprehension with expression-guided selective filtering and noun-oriented reasoning
The objective of referring expression comprehension (REC) is to find the common feature
domain between language expressions and visual objects. Due to the complex nature of …
domain between language expressions and visual objects. Due to the complex nature of …
[PDF][PDF] A survey on interpretable cross-modal reasoning
Authors' addresses: Dizhan Xue, xuedizhan17@ mails. ucas. ac. cn; Shengsheng Qian,
shengsheng. qian@ nlpr. ia. ac. cn; Zuyi Zhou, zhouzuyi2023@ ia. ac. cn, MAIS, Institute of …
shengsheng. qian@ nlpr. ia. ac. cn; Zuyi Zhou, zhouzuyi2023@ ia. ac. cn, MAIS, Institute of …
Universal Relocalizer for Weakly Supervised Referring Expression Grounding
This article introduces the Universal Relocalizer, a novel approach designed for weakly
supervised referring expression grounding. Our method strives to pinpoint a target proposal …
supervised referring expression grounding. Our method strives to pinpoint a target proposal …
RefCrowd: Grounding the target in crowd with referring expressions
Crowd understanding has aroused the widespread interest in vision domain due to its
important practical significance. Unfortunately, there is no effort to explore crowd …
important practical significance. Unfortunately, there is no effort to explore crowd …
Linking people across text and images based on social relation reasoning
As a sub-task of visual grounding, linking people across text and images aims to localize
target people in images with corresponding sentences. Existing approaches tend to capture …
target people in images with corresponding sentences. Existing approaches tend to capture …