Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
On evaluating adversarial robustness of large vision-language models
Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented
performance in response generation, especially with visual inputs, enabling more creative …
performance in response generation, especially with visual inputs, enabling more creative …
Call for Papers--The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus
We present the call for papers for the BabyLM Challenge: Sample-efficient pretraining on a
developmentally plausible corpus. This shared task is intended for participants with an …
developmentally plausible corpus. This shared task is intended for participants with an …
Negative object presence evaluation (nope) to measure object hallucination in vision-language models
Object hallucination poses a significant challenge in vision-language (VL) models, often
leading to the generation of nonsensical or unfaithful responses with non-existent objects …
leading to the generation of nonsensical or unfaithful responses with non-existent objects …
Naturalbench: Evaluating vision-language models on natural adversarial samples
Vision-language models (VLMs) have made significant progress in recent visual-question-
answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However …
answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However …
Mtvqa: Benchmarking multilingual text-centric visual question answering
Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates
human-machine interaction in text-centric visual environments but also serves as a de facto …
human-machine interaction in text-centric visual environments but also serves as a de facto …
Video question answering: Datasets, algorithms and challenges
Video Question Answering (VideoQA) aims to answer natural language questions according
to the given videos. It has earned increasing attention with recent research trends in joint …
to the given videos. It has earned increasing attention with recent research trends in joint …
An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models
Different from traditional task-specific vision models, recent large VLMs can readily adapt to
different vision tasks by simply using different textual instructions, ie, prompts. However, a …
different vision tasks by simply using different textual instructions, ie, prompts. However, a …
Learning to rematch mismatched pairs for robust cross-modal retrieval
Collecting well-matched multimedia datasets is crucial for training cross-modal retrieval
models. However in real-world scenarios massive multimodal data are harvested from the …
models. However in real-world scenarios massive multimodal data are harvested from the …
Are deep neural networks SMARTer than second graders?
Recent times have witnessed an increasing number of applications of deep neural networks
towards solving tasks that require superior cognitive abilities, eg, playing Go, generating art …
towards solving tasks that require superior cognitive abilities, eg, playing Go, generating art …