Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality

CY Hsieh, J Zhang, Z Ma… - Advances in neural …, 2023 - proceedings.neurips.cc
In the last year alone, a surge of new benchmarks to measure $\textit {compositional} $
understanding of vision-language models have permeated the machine learning ecosystem …

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Compositional chain-of-thought prompting for large multimodal models

C Mitra, B Huang, T Darrell… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

K Ying, F Meng, J Wang, Z Li, H Lin, Y Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Vision-Language Models (LVLMs) show significant strides in general-purpose
multimodal applications such as visual dialogue and embodied navigation. However …

What's" up" with vision-language models? Investigating their struggle with spatial reasoning

A Kamath, J Hessel, KW Chang - arxiv preprint arxiv:2310.19785, 2023 - arxiv.org
Recent vision-language (VL) models are powerful, but can they reliably distinguish" right"
from" left"? We curate three new corpora to quantify model comprehension of such basic …

Teaching structured vision & language concepts to vision & language models

S Doveh, A Arbelle, S Harary… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in
a variety of tasks. However, some aspects of complex language understanding still remain a …

Iterated learning improves compositionality in large vision-language models

C Zheng, J Zhang, A Kembhavi… - Proceedings of the …, 2024 - openaccess.thecvf.com
A fundamental characteristic common to both human vision and natural language is their
compositional nature. Yet despite the performance gains contributed by large vision and …

Dense and aligned captions (dac) promote compositional reasoning in vl models

S Doveh, A Arbelle, S Harary… - Advances in …, 2023 - proceedings.neurips.cc
Vision and Language (VL) models offer an effective method for aligning representation
spaces of images and text allowing for numerous applications such as cross-modal retrieval …

Going beyond nouns with vision & language models using synthetic data

P Cascante-Bonilla, K Shehada… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large-scale pre-trained Vision & Language (VL) models have shown remarkable
performance in many applications, enabling replacing a fixed set of supported classes with …

Match, expand and improve: Unsupervised finetuning for zero-shot action recognition with language knowledge

W Lin, L Karlinsky, N Shvetsova… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Large scale Vision-Language (VL) models have shown tremendous success in
aligning representations between visual and text modalities. This enables remarkable …