Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation
In the text-to-image generation field, recent remarkable progress in Stable Diffusion makes it
possible to generate rich kinds of novel photorealistic images. However, current models still …
possible to generate rich kinds of novel photorealistic images. However, current models still …
X-mesh: Towards fast and accurate text-driven 3d stylization via dynamic textual guidance
Text-driven 3D stylization is a complex and crucial task in the fields of computer vision (CV)
and computer graphics (CG), aimed at transforming a bare mesh to fit a target text. Prior …
and computer graphics (CG), aimed at transforming a bare mesh to fit a target text. Prior …
Constructing holistic spatio-temporal scene graph for video semantic role labeling
As one of the core video semantic understanding tasks, Video Semantic Role Labeling
(VidSRL) aims to detect the salient events from given videos, by recognizing the predict …
(VidSRL) aims to detect the salient events from given videos, by recognizing the predict …
Rotated multi-scale interaction network for referring remote sensing image segmentation
Abstract Referring Remote Sensing Image Segmentation (RRSIS) is a new challenge that
combines computer vision and natural language processing. Traditional Referring Image …
combines computer vision and natural language processing. Traditional Referring Image …
Beyond first impressions: Integrating joint multi-modal cues for comprehensive 3d representation
In recent years, 3D representation learning has turned to 2D vision-language pre-trained
models to overcome data scarcity challenges. However, existing methods simply transfer 2D …
models to overcome data scarcity challenges. However, existing methods simply transfer 2D …
Semi-supervised panoptic narrative grounding
Despite considerable progress, the advancement of Panoptic Narrative Grounding (PNG)
remains hindered by costly annotations. In this paper, we introduce a novel Semi …
remains hindered by costly annotations. In this paper, we introduce a novel Semi …
Beat: Bi-directional one-to-many embedding alignment for text-based person retrieval
Text-based person retrieval (TPR) is a challenging task that involves retrieving a specific
individual based on a textual description. Despite considerable efforts to bridge the gap …
individual based on a textual description. Despite considerable efforts to bridge the gap …
Piglet: Pixel-level grounding of language expressions with transformers
This paper proposes Panoptic Narrative Grounding, a spatially fine and general formulation
of the natural language visual grounding problem. We establish an experimental framework …
of the natural language visual grounding problem. We establish an experimental framework …
Improving Panoptic Narrative Grounding by Harnessing Semantic Relationships and Visual Confirmation
Recent advancements in single-stage Panoptic Narrative Grounding (PNG) have
demonstrated significant potential. These methods predict pixel-level masks by directly …
demonstrated significant potential. These methods predict pixel-level masks by directly …
Dynamic prompting of frozen text-to-image diffusion models for panoptic narrative grounding
Panoptic narrative grounding (PNG), whose core target is fine-grained image-text alignment,
requires a panoptic segmentation of referred objects given a narrative caption. Previous …
requires a panoptic segmentation of referred objects given a narrative caption. Previous …