Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods
Generating an image/video caption has always been a fundamental problem of Artificial
Intelligence, which is usually performed using the potential of Deep Learning Methods …
Intelligence, which is usually performed using the potential of Deep Learning Methods …
X-mesh: Towards fast and accurate text-driven 3d stylization via dynamic textual guidance
Text-driven 3D stylization is a complex and crucial task in the fields of computer vision (CV)
and computer graphics (CG), aimed at transforming a bare mesh to fit a target text. Prior …
and computer graphics (CG), aimed at transforming a bare mesh to fit a target text. Prior …
Rotated multi-scale interaction network for referring remote sensing image segmentation
Abstract Referring Remote Sensing Image Segmentation (RRSIS) is a new challenge that
combines computer vision and natural language processing. Traditional Referring Image …
combines computer vision and natural language processing. Traditional Referring Image …
Cross-modality perturbation synergy attack for person re-identification
In recent years, there has been significant research focusing on addressing security
concerns in single-modal person re-identification (ReID) systems that are based on RGB …
concerns in single-modal person re-identification (ReID) systems that are based on RGB …
Underwater image captioning: Challenges, models, and datasets
We delve into the nascent field of underwater image captioning from three perspectives:
challenges, models, and datasets. One challenge arises from the disparities between …
challenges, models, and datasets. One challenge arises from the disparities between …
3d-gres: Generalized 3d referring expression segmentation
3D Referring Expression Segmentation (3D-RES) is dedicated to segmenting a specific
instance within a 3D space based on a natural language description. However, current …
instance within a 3D space based on a natural language description. However, current …
Vision-language pre-training via modal interaction
Existing vision-language pre-training models typically extract region features and conduct
fine-grained local alignment based on masked image/text completion or object detection …
fine-grained local alignment based on masked image/text completion or object detection …
M3ixup: A multi-modal data augmentation approach for image captioning
Despite the great success, most models in image captioning (IC) are still stuck in the
dilemma of generating simple and non-discriminative captions. In this paper, we study this …
dilemma of generating simple and non-discriminative captions. In this paper, we study this …
An ensemble model with attention based mechanism for image captioning
Image captioning creates informative text from an input image by creating a relationship
between the words and the actual content of an image. Recently, deep learning models that …
between the words and the actual content of an image. Recently, deep learning models that …
ICEAP: An advanced fine-grained image captioning network with enhanced attribute predictor
Fine-grained image captioning is a focal point in the vision-to-language task and has
attracted considerable attention for generating accurate and contextually relevant image …
attracted considerable attention for generating accurate and contextually relevant image …