Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data
Abstract Multi-modal Large Language Models (MLLMs) tuned on machine-generated
instruction-following data have demonstrated remarkable performance in various multimodal …
instruction-following data have demonstrated remarkable performance in various multimodal …
Evaluating and analyzing relationship hallucinations in large vision-language models
The issue of hallucinations is a prevalent concern in existing Large Vision-Language
Models (LVLMs). Previous efforts have primarily focused on investigating object …
Models (LVLMs). Previous efforts have primarily focused on investigating object …
Cultural and linguistic diversity improves visual representations
Computer vision often treats perception as objective, and this assumption gets reflected in
the way that datasets are collected and models are trained. For instance, image descriptions …
the way that datasets are collected and models are trained. For instance, image descriptions …
Generative Region-Language Pretraining for Open-Ended Object Detection
In recent research significant attention has been devoted to the open-vocabulary object
detection task aiming to generalize beyond the limited number of classes labeled during …
detection task aiming to generalize beyond the limited number of classes labeled during …
MeaCap: Memory-Augmented Zero-shot Image Captioning
Zero-shot image captioning (IC) without well-paired image-text data can be categorized into
two main types: training-free and text-only-training methods. While both types integrate pre …
two main types: training-free and text-only-training methods. While both types integrate pre …
A Survey of Hallucination in Large Visual Language Models
The Large Visual Language Models (LVLMs) enhances user interaction and enriches user
experience by integrating visual modality on the basis of the Large Language Models …
experience by integrating visual modality on the basis of the Large Language Models …
Tag-grounded Visual Instruction Tuning with Retrieval Augmentation
Despite recent advances in the general visual instruction-following ability of Multimodal
Large Language Models (MLLMs), they still struggle with critical problems when required to …
Large Language Models (MLLMs), they still struggle with critical problems when required to …
HICEScore: A Hierarchical Metric for Image Captioning Evaluation
Image captioning evaluation metrics can be divided into two categories, reference-based
metrics and reference-free metrics. However, reference-based approaches may struggle to …
metrics and reference-free metrics. However, reference-based approaches may struggle to …
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning
Generating detailed captions comprehending text-rich visual content in images has received
growing attention for Large Vision-Language Models (LVLMs). However, few studies have …
growing attention for Large Vision-Language Models (LVLMs). However, few studies have …
Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags
Despite recent advances in the general visual instruction-following ability of Multimodal
Large Language Models (MLLMs), they still struggle with critical problems when required to …
Large Language Models (MLLMs), they still struggle with critical problems when required to …