A comprehensive survey of deep learning for image captioning
Generating a description of an image is called image captioning. Image captioning requires
recognizing the important objects, their attributes, and their relationships in an image. It also …
recognizing the important objects, their attributes, and their relationships in an image. It also …
[HTML][HTML] Adversarial text-to-image synthesis: A review
With the advent of generative adversarial networks, synthesizing images from text
descriptions has recently become an active research area. It is a flexible and intuitive way for …
descriptions has recently become an active research area. It is a flexible and intuitive way for …
Diffusiondet: Diffusion model for object detection
We propose DiffusionDet, a new framework that formulates object detection as a denoising
diffusion process from noisy boxes to object boxes. During the training stage, object boxes …
diffusion process from noisy boxes to object boxes. During the training stage, object boxes …
Beyond transmitting bits: Context, semantics, and task-oriented communications
Communication systems to date primarily aim at reliably communicating bit sequences.
Such an approach provides efficient engineering designs that are agnostic to the meanings …
Such an approach provides efficient engineering designs that are agnostic to the meanings …
Transformer-based visual segmentation: A survey
Visual segmentation seeks to partition images, video frames, or point clouds into multiple
segments or groups. This technique has numerous real-world applications, such as …
segments or groups. This technique has numerous real-world applications, such as …
Semantic communications: Principles and challenges
Semantic communication, regarded as the breakthrough beyond the Shannon paradigm,
aims at the successful transmission of semantic information conveyed by the source rather …
aims at the successful transmission of semantic information conveyed by the source rather …
[HTML][HTML] Cpt: Colorful prompt tuning for pre-trained vision-language models
Abstract Vision-Language Pre-training (VLP) models have shown promising capabilities in
grounding natural language in image data, facilitating a broad range of cross-modal tasks …
grounding natural language in image data, facilitating a broad range of cross-modal tasks …
Compositional chain-of-thought prompting for large multimodal models
The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …
Imagine that! abstract-to-intricate text-to-image synthesis with scene graph hallucination diffusion
In this work, we investigate the task of text-to-image (T2I) synthesis under the abstract-to-
intricate setting, ie, generating intricate visual content from simple abstract text prompts …
intricate setting, ie, generating intricate visual content from simple abstract text prompts …
Unbiased scene graph generation from biased training
Today's scene graph generation (SGG) task is still far from practical, mainly due to the
severe training bias, eg, collapsing diverse" human walk on/sit on/lay on beach" into" human …
severe training bias, eg, collapsing diverse" human walk on/sit on/lay on beach" into" human …