Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
From show to tell: A survey on deep learning-based image captioning
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …
reason, large research efforts have been devoted to image captioning, ie describing images …
Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts
Inspired by the strong ties between vision and language, the two intimate human sensing
and communication modalities, our paper aims to explore the generation of 3D human full …
and communication modalities, our paper aims to explore the generation of 3D human full …
Multi-modal knowledge graph construction and application: A survey
Recent years have witnessed the resurgence of knowledge engineering which is featured
by the fast growth of knowledge graphs. However, most of existing knowledge graphs are …
by the fast growth of knowledge graphs. However, most of existing knowledge graphs are …
Rstnet: Captioning with adaptive attention on visual and non-visual words
Recent progress on visual question answering has explored the merits of grid features for
vision language tasks. Meanwhile, transformer-based models have shown remarkable …
vision language tasks. Meanwhile, transformer-based models have shown remarkable …
Attention on attention for image captioning
Attention mechanisms are widely used in current encoder/decoder frameworks of image
captioning, where a weighted average on encoded vectors is generated at each time step to …
captioning, where a weighted average on encoded vectors is generated at each time step to …
Videobert: A joint model for video and language representation learning
Self-supervised learning has become increasingly important to leverage the abundance of
unlabeled data available on platforms like YouTube. Whereas most existing approaches …
unlabeled data available on platforms like YouTube. Whereas most existing approaches …
Auto-encoding scene graphs for image captioning
Abstract We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language
inductive bias into the encoder-decoder image captioning framework for more human-like …
inductive bias into the encoder-decoder image captioning framework for more human-like …
A comprehensive survey of deep learning for image captioning
Generating a description of an image is called image captioning. Image captioning requires
recognizing the important objects, their attributes, and their relationships in an image. It also …
recognizing the important objects, their attributes, and their relationships in an image. It also …
Visualgpt: Data-efficient adaptation of pretrained language models for image captioning
The limited availability of annotated data often hinders real-world applications of machine
learning. To efficiently learn from small quantities of multimodal data, we leverage the …
learning. To efficiently learn from small quantities of multimodal data, we leverage the …