Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Deep learning approaches on image captioning: A review
Image captioning is a research area of immense importance, aiming to generate natural
language descriptions for visual content in the form of still images. The advent of deep …
language descriptions for visual content in the form of still images. The advent of deep …
Multimodal foundation models: From specialists to general-purpose assistants
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …
methods to data compression. Recent advances in statistical machine learning have opened …
Git: A generative image-to-text transformer for vision and language
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify
vision-language tasks such as image/video captioning and question answering. While …
vision-language tasks such as image/video captioning and question answering. While …
Coarse-to-fine vision-language pre-training with fusion in the backbone
Abstract Vision-language (VL) pre-training has recently received considerable attention.
However, most existing end-to-end pre-training approaches either only aim to tackle VL …
However, most existing end-to-end pre-training approaches either only aim to tackle VL …
Semantic-conditional diffusion networks for image captioning
Recent advances on text-to-image generation have witnessed the rise of diffusion models
which act as powerful generative models. Nevertheless, it is not trivial to exploit such latent …
which act as powerful generative models. Nevertheless, it is not trivial to exploit such latent …
Meacap: Memory-augmented zero-shot image captioning
Zero-shot image captioning (IC) without well-paired image-text data can be categorized into
two main types: training-free and text-only-training methods. While both types integrate pre …
two main types: training-free and text-only-training methods. While both types integrate pre …
Caption anything: Interactive image description with diverse multimodal controls
Controllable image captioning is an emerging multimodal topic that aims to describe the
image with natural language following human purpose, $\textit {eg} $, looking at the …
image with natural language following human purpose, $\textit {eg} $, looking at the …
Tag2text: Guiding vision-language model via image tagging
This paper presents Tag2Text, a vision language pre-training (VLP) framework, which
introduces image tagging into vision-language models to guide the learning of visual …
introduces image tagging into vision-language models to guide the learning of visual …
Conzic: Controllable zero-shot image captioning by sampling-based polishing
Zero-shot capability has been considered as a new revolution of deep learning, letting
machines work on tasks without curated training data. As a good start and the only existing …
machines work on tasks without curated training data. As a good start and the only existing …