VGDiffZero: Text-to-image diffusion models can be zero-shot visual grounders

X Liu, S Huang, Y Kang, H Chen… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Large-scale text-to-image diffusion models have shown impressive capabilities for
generative tasks by leveraging strong vision-language alignment from pre-training …

Unified Text-to-Image Generation and Retrieval

L Qu, H Li, T Wang, W Wang, Y Li, L Nie… - arxiv preprint arxiv …, 2024 - arxiv.org
How humans can efficiently and effectively acquire images has always been a perennial
question. A typical solution is text-to-image retrieval from an existing database given the text …

: Interpreting and leveraging semantic information in diffusion models

D Kim, X Thomas, D Ghadiyaram - arxiv preprint arxiv:2411.16725, 2024 - arxiv.org
We study $\textit {how} $ rich visual semantic information is represented within various
layers and denoising timesteps of different diffusion architectures. We uncover …

FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

X He, J Zheng, JZ Fang, R Piramuthu, M Bansal… - arxiv preprint arxiv …, 2024 - arxiv.org
Controllable text-to-image (T2I) diffusion models generate images conditioned on both text
prompts and semantic inputs of other modalities like edge maps. Nevertheless, current …

GenzIQA: Generalized Image Quality Assessment using Prompt-Guided Latent Diffusion Models

D De, S Mitra, R Soundararajan - arxiv preprint arxiv:2406.04654, 2024 - arxiv.org
The design of no-reference (NR) image quality assessment (IQA) algorithms is extremely
important to benchmark and calibrate user experiences in modern visual systems. A major …

[PDF][PDF] Multimodal Understanding using Stable-Diffusion as a Task Aware Feature Extractor

V Agarwal, G Kohavi, M Gwilliam, E Verma, D Ulbricht… - vatsalag99.github.io
Multimodal large language models have shown tremendous advancements in parsing and
reasoning about complex scenes. However recent research has highlighted the weak vision …