Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Visual question answering: A survey of methods and datasets
Abstract Visual Question Answering (VQA) is a challenging task that has received increasing
attention from both the computer vision and the natural language processing communities …
attention from both the computer vision and the natural language processing communities …
From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities
The multimodal task of Visual Question Answering (VQA) encompassing elements of
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …
Evaluating object hallucination in large vision-language models
Inspired by the superior language abilities of large language models (LLM), large vision-
language models (LVLM) have been recently explored by integrating powerful LLMs for …
language models (LVLM) have been recently explored by integrating powerful LLMs for …
3d-llm: Injecting the 3d world into large language models
Large language models (LLMs) and Vision-Language Models (VLMs) have been proved to
excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be …
excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be …
Learn to explain: Multimodal reasoning via thought chains for science question answering
When answering a question, humans utilize the information available across different
modalities to synthesize a consistent and complete chain of thought (CoT). This process is …
modalities to synthesize a consistent and complete chain of thought (CoT). This process is …
Learning to answer questions in dynamic audio-visual scenarios
In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to
answer questions regarding different visual objects, sounds, and their associations in …
answer questions regarding different visual objects, sounds, and their associations in …
Unified vision-language pre-training for image captioning and vqa
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is
unified in that (1) it can be fine-tuned for either vision-language generation (eg, image …
unified in that (1) it can be fine-tuned for either vision-language generation (eg, image …
Counterfactual vqa: A cause-effect look at language bias
Recent VQA models may tend to rely on language bias as a shortcut and thus fail to
sufficiently learn the multi-modal knowledge from both vision and language. In this paper …
sufficiently learn the multi-modal knowledge from both vision and language. In this paper …
Gqa: A new dataset for real-world visual reasoning and compositional question answering
We introduce GQA, a new dataset for real-world visual reasoning and compositional
question answering, seeking to address key shortcomings of previous VQA datasets. We …
question answering, seeking to address key shortcomings of previous VQA datasets. We …
What matters in training a gpt4-style language model with multimodal inputs?
Recent advancements in GPT-4V have displayed remarkable multi-modal capabilities in
processing image inputs and following open-ended instructions. Despite these …
processing image inputs and following open-ended instructions. Despite these …