- Academic Search

Q Yu, J Li, L Wei, L Pang, W Ye, B Qin… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Multi-modal Large Language Models (MLLMs) tuned on machine-generated
instruction-following data have demonstrated remarkable performance in various multimodal …

Save Cite Cited by 52 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Evaluating and analyzing relationship hallucinations in large vision-language models

M Wu, J Ji, O Huang, J Li, Y Wu, X Sun, R Ji - arxiv preprint arxiv …, 2024 - arxiv.org

The issue of hallucinations is a prevalent concern in existing Large Vision-Language
Models (LVLMs). Previous efforts have primarily focused on investigating object …

Save Cite Cited by 8 Related articles View as HTML

[Free GPT-4]

[PDF] arxiv.org

Cultural and linguistic diversity improves visual representations

A Ye, S Santy, JD Hwang, AX Zhang… - arxiv preprint arxiv …, 2023 - arxiv.org

Computer vision often treats perception as objective, and this assumption gets reflected in
the way that datasets are collected and models are trained. For instance, image descriptions …

Save Cite Cited by 6 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

Generative Region-Language Pretraining for Open-Ended Object Detection

C Lin, Y Jiang, L Qu, Z Yuan… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

In recent research significant attention has been devoted to the open-vocabulary object
detection task aiming to generalize beyond the limited number of classes labeled during …

Save Cite Cited by 11 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

MeaCap: Memory-Augmented Zero-shot Image Captioning

Z Zeng, Y **e, H Zhang, C Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com

Zero-shot image captioning (IC) without well-paired image-text data can be categorized into
two main types: training-free and text-only-training methods. While both types integrate pre …

Save Cite Cited by 18 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

A Survey of Hallucination in Large Visual Language Models

W Lan, W Chen, Q Chen, S Pan, H Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org

The Large Visual Language Models (LVLMs) enhances user interaction and enriches user
experience by integrating visual modality on the basis of the Large Language Models …

[Free GPT-4]

[PDF] aclanthology.org

Tag-grounded Visual Instruction Tuning with Retrieval Augmentation

D Qi, H Zhao, Z Wei, S Li - … of the 2024 Conference on Empirical …, 2024 - aclanthology.org

Despite recent advances in the general visual instruction-following ability of Multimodal
Large Language Models (MLLMs), they still struggle with critical problems when required to …

[Free GPT-4]

[PDF] arxiv.org

HICEScore: A Hierarchical Metric for Image Captioning Evaluation

Z Zeng, J Sun, H Zhang, T Wen, Y Su, Y **e… - Proceedings of the …, 2024 - dl.acm.org

Image captioning evaluation metrics can be divided into two categories, reference-based
metrics and reference-free metrics. However, reference-based approaches may struggle to …

[Free GPT-4]

[PDF] arxiv.org

Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning

F Lu, W Wu, K Zheng, S Ma, B Gong, J Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

Generating detailed captions comprehending text-rich visual content in images has received
growing attention for Large Vision-Language Models (LVLMs). However, few studies have …

[Free GPT-4]

[PDF] arxiv.org

Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags

D Qi, H Zhao, Z Wei, S Li - arxiv preprint arxiv:2406.10839, 2024 - arxiv.org

Despite recent advances in the general visual instruction-following ability of Multimodal
Large Language Models (MLLMs), they still struggle with critical problems when required to …

Save Cite Cited by 1 Related articles View as HTML

Create alert

Cite

Advanced search

Saved to My library

Factual: A benchmark for faithful and consistent textual scene graph parsing

Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data

Evaluating and analyzing relationship hallucinations in large vision-language models

Cultural and linguistic diversity improves visual representations

Generative Region-Language Pretraining for Open-Ended Object Detection

MeaCap: Memory-Augmented Zero-shot Image Captioning

A Survey of Hallucination in Large Visual Language Models

Tag-grounded Visual Instruction Tuning with Retrieval Augmentation

HICEScore: A Hierarchical Metric for Image Captioning Evaluation

Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning

Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags