Knowledge editing for large language models: A survey
Large Language Models (LLMs) have recently transformed both the academic and industrial
landscapes due to their remarkable capacity to understand, analyze, and generate texts …
landscapes due to their remarkable capacity to understand, analyze, and generate texts …
A complete survey on generative ai (aigc): Is chatgpt from gpt-4 to gpt-5 all you need?
As ChatGPT goes viral, generative AI (AIGC, aka AI-generated content) has made headlines
everywhere because of its ability to analyze and create text, images, and beyond. With such …
everywhere because of its ability to analyze and create text, images, and beyond. With such …
Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Clipcap: Clip prefix for image captioning
Image captioning is a fundamental task in vision-language understanding, where the model
predicts a textual informative caption to a given input image. In this paper, we present a …
predicts a textual informative caption to a given input image. In this paper, we present a …
Scaling up vision-language pre-training for image captioning
In recent years, we have witnessed significant performance boost in the image captioning
task based on vision-language pre-training (VLP). Scale is believed to be an important factor …
task based on vision-language pre-training (VLP). Scale is believed to be an important factor …
From show to tell: A survey on deep learning-based image captioning
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …
reason, large research efforts have been devoted to image captioning, ie describing images …
Adabins: Depth estimation using adaptive bins
We address the problem of estimating a high quality dense depth map from a single RGB
input image. We start out with a baseline encoder-decoder convolutional neural network …
input image. We start out with a baseline encoder-decoder convolutional neural network …
Meshed-memory transformer for image captioning
Transformer-based architectures represent the state of the art in sequence modeling tasks
like machine translation and language understanding. Their applicability to multi-modal …
like machine translation and language understanding. Their applicability to multi-modal …
Dual-level collaborative transformer for image captioning
Descriptive region features extracted by object detection networks have played an important
role in the recent advancements of image captioning. However, they are still criticized for the …
role in the recent advancements of image captioning. However, they are still criticized for the …
Rstnet: Captioning with adaptive attention on visual and non-visual words
Recent progress on visual question answering has explored the merits of grid features for
vision language tasks. Meanwhile, transformer-based models have shown remarkable …
vision language tasks. Meanwhile, transformer-based models have shown remarkable …