Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition
Vision and language models (VLMs) such as CLIP have showcased remarkable zero-shot
recognition abilities yet face challenges in visio-linguistic compositionality, particularly in …
recognition abilities yet face challenges in visio-linguistic compositionality, particularly in …
Interpretable Composition Attribution Enhancement for Visio-linguistic Compositional Understanding
Contrastively trained vision-language models such as CLIP have achieved remarkable
progress in vision and language representation learning. Despite the promising progress …
progress in vision and language representation learning. Despite the promising progress …
Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective
Accurate interpretation and visualization of human instructions are crucial for text-to-image
(T2I) synthesis. However, current models struggle to capture semantic variations from word …
(T2I) synthesis. However, current models struggle to capture semantic variations from word …