Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data

Q Yu, J Li, L Wei, L Pang, W Ye, B Qin… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Multi-modal Large Language Models (MLLMs) tuned on machine-generated
instruction-following data have demonstrated remarkable performance in various multimodal …

Evaluating and analyzing relationship hallucinations in large vision-language models

M Wu, J Ji, O Huang, J Li, Y Wu, X Sun, R Ji - arxiv preprint arxiv …, 2024 - arxiv.org
The issue of hallucinations is a prevalent concern in existing Large Vision-Language
Models (LVLMs). Previous efforts have primarily focused on investigating object …

Cultural and linguistic diversity improves visual representations

A Ye, S Santy, JD Hwang, AX Zhang… - arxiv preprint arxiv …, 2023 - arxiv.org
Computer vision often treats perception as objective, and this assumption gets reflected in
the way that datasets are collected and models are trained. For instance, image descriptions …

Generative Region-Language Pretraining for Open-Ended Object Detection

C Lin, Y Jiang, L Qu, Z Yuan… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
In recent research significant attention has been devoted to the open-vocabulary object
detection task aiming to generalize beyond the limited number of classes labeled during …

MeaCap: Memory-Augmented Zero-shot Image Captioning

Z Zeng, Y **e, H Zhang, C Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Zero-shot image captioning (IC) without well-paired image-text data can be categorized into
two main types: training-free and text-only-training methods. While both types integrate pre …

A Survey of Hallucination in Large Visual Language Models

W Lan, W Chen, Q Chen, S Pan, H Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
The Large Visual Language Models (LVLMs) enhances user interaction and enriches user
experience by integrating visual modality on the basis of the Large Language Models …

Tag-grounded Visual Instruction Tuning with Retrieval Augmentation

D Qi, H Zhao, Z Wei, S Li - … of the 2024 Conference on Empirical …, 2024 - aclanthology.org
Despite recent advances in the general visual instruction-following ability of Multimodal
Large Language Models (MLLMs), they still struggle with critical problems when required to …

HICEScore: A Hierarchical Metric for Image Captioning Evaluation

Z Zeng, J Sun, H Zhang, T Wen, Y Su, Y **e… - Proceedings of the …, 2024 - dl.acm.org
Image captioning evaluation metrics can be divided into two categories, reference-based
metrics and reference-free metrics. However, reference-based approaches may struggle to …

Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning

F Lu, W Wu, K Zheng, S Ma, B Gong, J Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Generating detailed captions comprehending text-rich visual content in images has received
growing attention for Large Vision-Language Models (LVLMs). However, few studies have …

Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags

D Qi, H Zhao, Z Wei, S Li - arxiv preprint arxiv:2406.10839, 2024 - arxiv.org
Despite recent advances in the general visual instruction-following ability of Multimodal
Large Language Models (MLLMs), they still struggle with critical problems when required to …