Mitigating object hallucinations in large vision-language models through visual contrastive decoding

S Leng, H Zhang, G Chen, X Li, S Lu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Large Vision-Language Models (LVLMs) have advanced considerably intertwining
visual recognition and language understanding to generate content that is not only coherent …

Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning

Z Li, X Wang, E Stengel-Eskin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Visual Question Answering (VQA) models often perform poorly on out-of-distribution
data and struggle on domain generalization. Due to the multi-modal nature of this task …

Prompt-guided zero-shot anomaly action recognition using pretrained deep skeleton features

F Sato, R Hachiuma, T Sekii - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
This study investigates unsupervised anomaly action recognition, which identifies video-
level abnormal-human-behavior events in an unsupervised manner without abnormal …

Negative object presence evaluation (nope) to measure object hallucination in vision-language models

H Lovenia, W Dai, S Cahyawijaya, Z Ji… - arxiv preprint arxiv …, 2023 - arxiv.org
Object hallucination poses a significant challenge in vision-language (VL) models, often
leading to the generation of nonsensical or unfaithful responses with non-existent objects …

Simvqa: Exploring simulated environments for visual question answering

P Cascante-Bonilla, H Wu, L Wang… - Proceedings of the …, 2022 - openaccess.thecvf.com
Existing work on VQA explores data augmentation to achieve better generalization by
perturbing the images in the dataset or modifying the existing questions and answers. While …

[PDF][PDF] Survey on sociodemographic bias in natural language processing

V Gupta, PN Venkit, S Wilson… - arxiv preprint arxiv …, 2023 - researchgate.net
Deep neural networks often learn unintended bias during training, which might have harmful
effects when deployed in realworld settings. This work surveys 214 papers related to …

3d-aware visual question answering about parts, poses and occlusions

X Wang, W Ma, Z Li, A Kortylewski… - Advances in Neural …, 2024 - proceedings.neurips.cc
Despite rapid progress in Visual question answering (\textit {VQA}), existing datasets and
models mainly focus on testing reasoning in 2D. However, it is important that VQA models …

Masked images are counterfactual samples for robust fine-tuning

Y **ao, Z Tang, P Wei, C Liu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Deep learning models are challenged by the distribution shift between the training data and
test data. Recently, the large models pre-trained on diverse data have demonstrated …

ReactioNet: Learning High-order Facial Behavior from Universal Stimulus-Reaction by Dyadic Relation Reasoning

X Li, T Wang, G Zhao, X Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Diverse visual stimuli can evoke various human affective states, which are usually
manifested in an individual's muscular actions and facial expressions. In lab-controlled …

Integrating language guidance into image-text matching for correcting false negatives

Z Li, C Guo, Z Feng, JN Hwang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Image-Text Matching (ITM) aims to establish the correspondence between images and
sentences. ITM is fundamental to various vision and language understanding tasks …