- Academic Search

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

บันทึก อ้างอิง อ้างโดย198 บทความที่เกี่ยวข้อง ทั้งหมด 7 ฉบับ Library Search ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

On evaluating adversarial robustness of large vision-language models

Y Zhao, T Pang, C Du, X Yang, C Li… - Advances in …, 2023 - proceedings.neurips.cc

Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented
performance in response generation, especially with visual inputs, enabling more creative …

บันทึก อ้างอิง อ้างโดย194 บทความที่เกี่ยวข้อง ทั้งหมด 9 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Call for Papers--The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

A Warstadt, L Choshen, A Mueller, A Williams… - arxiv preprint arxiv …, 2023 - arxiv.org

We present the call for papers for the BabyLM Challenge: Sample-efficient pretraining on a
developmentally plausible corpus. This shared task is intended for participants with an …

บันทึก อ้างอิง อ้างโดย144 บทความที่เกี่ยวข้อง ทั้งหมด 11 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Negative object presence evaluation (nope) to measure object hallucination in vision-language models

H Lovenia, W Dai, S Cahyawijaya, Z Ji… - arxiv preprint arxiv …, 2023 - arxiv.org

Object hallucination poses a significant challenge in vision-language (VL) models, often
leading to the generation of nonsensical or unfaithful responses with non-existent objects …

บันทึก อ้างอิง อ้างโดย55 บทความที่เกี่ยวข้อง ทั้งหมด 4 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Naturalbench: Evaluating vision-language models on natural adversarial samples

B Li, Z Lin, W Peng, JD Nyandwi… - Advances in …, 2025 - proceedings.neurips.cc

Vision-language models (VLMs) have made significant progress in recent visual-question-
answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However …

บันทึก อ้างอิง อ้างโดย8 บทความที่เกี่ยวข้อง ทั้งหมด 3 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mtvqa: Benchmarking multilingual text-centric visual question answering

J Tang, Q Liu, Y Ye, J Lu, S Wei, C Lin, W Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates
human-machine interaction in text-centric visual environments but also serves as a de facto …

บันทึก อ้างอิง อ้างโดย30 บทความที่เกี่ยวข้อง ทั้งหมด 4 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Video question answering: Datasets, algorithms and challenges

Y Zhong, J **ao, W Ji, Y Li, W Deng… - arxiv preprint arxiv …, 2022 - arxiv.org

Video Question Answering (VideoQA) aims to answer natural language questions according
to the given videos. It has earned increasing attention with recent research trends in joint …

บันทึก อ้างอิง อ้างโดย100 บทความที่เกี่ยวข้อง ทั้งหมด 4 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models

H Luo, J Gu, F Liu, P Torr - arxiv preprint arxiv:2403.09766, 2024 - arxiv.org

Different from traditional task-specific vision models, recent large VLMs can readily adapt to
different vision tasks by simply using different textual instructions, ie, prompts. However, a …

บันทึก อ้างอิง อ้างโดย23 บทความที่เกี่ยวข้อง ทั้งหมด 3 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Learning to rematch mismatched pairs for robust cross-modal retrieval

H Han, Q Zheng, G Dai, M Luo… - Proceedings of the …, 2024 - openaccess.thecvf.com

Collecting well-matched multimedia datasets is crucial for training cross-modal retrieval
models. However in real-world scenarios massive multimodal data are harvested from the …

บันทึก อ้างอิง อ้างโดย7 บทความที่เกี่ยวข้อง ทั้งหมด 7 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Are deep neural networks SMARTer than second graders?

A Cherian, KC Peng, S Lohit… - Proceedings of the …, 2023 - openaccess.thecvf.com

Recent times have witnessed an increasing number of applications of deep neural networks
towards solving tasks that require superior cognitive abilities, eg, playing Go, generating art …

บันทึก อ้างอิง อ้างโดย31 บทความที่เกี่ยวข้อง ทั้งหมด 9 ฉบับ ดูในรูปแบบ HTML

สร้างการแจ้งเตือน

อ้างอิง

การค้นหาขั้นสูง

บันทึกไปยังคลังของฉันแล้ว

Human-adversarial visual question answering

Vision-language pre-training: Basics, recent advances, and future trends

On evaluating adversarial robustness of large vision-language models

Call for Papers--The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

Negative object presence evaluation (nope) to measure object hallucination in vision-language models

Naturalbench: Evaluating vision-language models on natural adversarial samples

Mtvqa: Benchmarking multilingual text-centric visual question answering

Video question answering: Datasets, algorithms and challenges

An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models

Learning to rematch mismatched pairs for robust cross-modal retrieval

Are deep neural networks SMARTer than second graders?