Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition

Y Oh, P Ahn, J Kim, G Song, S Lee, IS Kweon… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision and language models (VLMs) such as CLIP have showcased remarkable zero-shot
recognition abilities yet face challenges in visio-linguistic compositionality, particularly in …

Interpretable Composition Attribution Enhancement for Visio-linguistic Compositional Understanding

W Li, Z Huang, X Tian, L Lu, H Li… - Proceedings of the …, 2024 - aclanthology.org
Contrastively trained vision-language models such as CLIP have achieved remarkable
progress in vision and language representation learning. Despite the promising progress …

Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective

X Zhu, P Sun, Y Song, Y **ao, Z Li, C Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Accurate interpretation and visualization of human instructions are crucial for text-to-image
(T2I) synthesis. However, current models struggle to capture semantic variations from word …