Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

On evaluating adversarial robustness of large vision-language models

Y Zhao, T Pang, C Du, X Yang, C Li… - Advances in …, 2023 - proceedings.neurips.cc
Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented
performance in response generation, especially with visual inputs, enabling more creative …

Negative object presence evaluation (nope) to measure object hallucination in vision-language models

H Lovenia, W Dai, S Cahyawijaya, Z Ji… - arxiv preprint arxiv …, 2023 - arxiv.org
Object hallucination poses a significant challenge in vision-language (VL) models, often
leading to the generation of nonsensical or unfaithful responses with non-existent objects …

Naturalbench: Evaluating vision-language models on natural adversarial samples

B Li, Z Lin, W Peng, JD Nyandwi… - Advances in …, 2025 - proceedings.neurips.cc
Vision-language models (VLMs) have made significant progress in recent visual-question-
answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However …

Mtvqa: Benchmarking multilingual text-centric visual question answering

J Tang, Q Liu, Y Ye, J Lu, S Wei, C Lin, W Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates
human-machine interaction in text-centric visual environments but also serves as a de facto …

Video question answering: Datasets, algorithms and challenges

Y Zhong, J **ao, W Ji, Y Li, W Deng… - arxiv preprint arxiv …, 2022 - arxiv.org
Video Question Answering (VideoQA) aims to answer natural language questions according
to the given videos. It has earned increasing attention with recent research trends in joint …

An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models

H Luo, J Gu, F Liu, P Torr - arxiv preprint arxiv:2403.09766, 2024 - arxiv.org
Different from traditional task-specific vision models, recent large VLMs can readily adapt to
different vision tasks by simply using different textual instructions, ie, prompts. However, a …

Learning to rematch mismatched pairs for robust cross-modal retrieval

H Han, Q Zheng, G Dai, M Luo… - Proceedings of the …, 2024 - openaccess.thecvf.com
Collecting well-matched multimedia datasets is crucial for training cross-modal retrieval
models. However in real-world scenarios massive multimodal data are harvested from the …

Are deep neural networks SMARTer than second graders?

A Cherian, KC Peng, S Lohit… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recent times have witnessed an increasing number of applications of deep neural networks
towards solving tasks that require superior cognitive abilities, eg, playing Go, generating art …