Mitigating object hallucinations in large vision-language models through visual contrastive decoding
Abstract Large Vision-Language Models (LVLMs) have advanced considerably intertwining
visual recognition and language understanding to generate content that is not only coherent …
visual recognition and language understanding to generate content that is not only coherent …
Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning
Abstract Visual Question Answering (VQA) models often perform poorly on out-of-distribution
data and struggle on domain generalization. Due to the multi-modal nature of this task …
data and struggle on domain generalization. Due to the multi-modal nature of this task …
Prompt-guided zero-shot anomaly action recognition using pretrained deep skeleton features
This study investigates unsupervised anomaly action recognition, which identifies video-
level abnormal-human-behavior events in an unsupervised manner without abnormal …
level abnormal-human-behavior events in an unsupervised manner without abnormal …
Negative object presence evaluation (nope) to measure object hallucination in vision-language models
Object hallucination poses a significant challenge in vision-language (VL) models, often
leading to the generation of nonsensical or unfaithful responses with non-existent objects …
leading to the generation of nonsensical or unfaithful responses with non-existent objects …
Simvqa: Exploring simulated environments for visual question answering
Existing work on VQA explores data augmentation to achieve better generalization by
perturbing the images in the dataset or modifying the existing questions and answers. While …
perturbing the images in the dataset or modifying the existing questions and answers. While …
[PDF][PDF] Survey on sociodemographic bias in natural language processing
Deep neural networks often learn unintended bias during training, which might have harmful
effects when deployed in realworld settings. This work surveys 214 papers related to …
effects when deployed in realworld settings. This work surveys 214 papers related to …
3d-aware visual question answering about parts, poses and occlusions
Despite rapid progress in Visual question answering (\textit {VQA}), existing datasets and
models mainly focus on testing reasoning in 2D. However, it is important that VQA models …
models mainly focus on testing reasoning in 2D. However, it is important that VQA models …
Masked images are counterfactual samples for robust fine-tuning
Deep learning models are challenged by the distribution shift between the training data and
test data. Recently, the large models pre-trained on diverse data have demonstrated …
test data. Recently, the large models pre-trained on diverse data have demonstrated …
ReactioNet: Learning High-order Facial Behavior from Universal Stimulus-Reaction by Dyadic Relation Reasoning
Diverse visual stimuli can evoke various human affective states, which are usually
manifested in an individual's muscular actions and facial expressions. In lab-controlled …
manifested in an individual's muscular actions and facial expressions. In lab-controlled …
Integrating language guidance into image-text matching for correcting false negatives
Image-Text Matching (ITM) aims to establish the correspondence between images and
sentences. ITM is fundamental to various vision and language understanding tasks …
sentences. ITM is fundamental to various vision and language understanding tasks …