Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Multimodal machine learning: A survey and taxonomy
Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …
odors, and taste flavors. Modality refers to the way in which something happens or is …
Large-scale multi-modal pre-trained models: A comprehensive survey
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …
Evaluating object hallucination in large vision-language models
Inspired by the superior language abilities of large language models (LLM), large vision-
language models (LVLM) have been recently explored by integrating powerful LLMs for …
language models (LVLM) have been recently explored by integrating powerful LLMs for …
Learning transferable visual models from natural language supervision
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined
object categories. This restricted form of supervision limits their generality and usability since …
object categories. This restricted form of supervision limits their generality and usability since …
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning
We present a new dataset of image caption annotations, Conceptual Captions, which
contains an order of magnitude more images than the MS-COCO dataset (Lin et al., 2014) …
contains an order of magnitude more images than the MS-COCO dataset (Lin et al., 2014) …
Visual genome: Connecting language and vision using crowdsourced dense image annotations
Despite progress in perceptual tasks such as image classification, computers still perform
poorly on cognitive tasks such as image description and question answering. Cognition is …
poorly on cognitive tasks such as image description and question answering. Cognition is …
Supervised learning of universal sentence representations from natural language inference data
Many modern NLP systems rely on word embeddings, previously trained in an unsupervised
manner on large corpora, as base features. Efforts to obtain embeddings for larger chunks of …
manner on large corpora, as base features. Efforts to obtain embeddings for larger chunks of …
Vqa: Visual question answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given
an image and a natural language question about the image, the task is to provide an …
an image and a natural language question about the image, the task is to provide an …
Show, attend and tell: Neural image caption generation with visual attention
Inspired by recent work in machine translation and object detection, we introduce an
attention based model that automatically learns to describe the content of images. We …
attention based model that automatically learns to describe the content of images. We …
Long-term recurrent convolutional networks for visual recognition and description
Abstract Models comprised of deep convolutional network layers have dominated recent
image interpretation tasks; we investigate whether models which are also compositional, or" …
image interpretation tasks; we investigate whether models which are also compositional, or" …