Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods

MS Wajid, H Terashima‐Marin, P Najafirad… - Engineering …, 2024 - Wiley Online Library
Generating an image/video caption has always been a fundamental problem of Artificial
Intelligence, which is usually performed using the potential of Deep Learning Methods …

X-mesh: Towards fast and accurate text-driven 3d stylization via dynamic textual guidance

Y Ma, X Zhang, X Sun, J Ji, H Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Text-driven 3D stylization is a complex and crucial task in the fields of computer vision (CV)
and computer graphics (CG), aimed at transforming a bare mesh to fit a target text. Prior …

Rotated multi-scale interaction network for referring remote sensing image segmentation

S Liu, Y Ma, X Zhang, H Wang, J Ji… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Referring Remote Sensing Image Segmentation (RRSIS) is a new challenge that
combines computer vision and natural language processing. Traditional Referring Image …

Cross-modality perturbation synergy attack for person re-identification

Y Gong, Z Zhong, Y Qu, Z Luo, R Ji, M Jiang - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, there has been significant research focusing on addressing security
concerns in single-modal person re-identification (ReID) systems that are based on RGB …

Underwater image captioning: Challenges, models, and datasets

H Li, H Wang, Y Zhang, L Li, P Ren - ISPRS Journal of Photogrammetry and …, 2025 - Elsevier
We delve into the nascent field of underwater image captioning from three perspectives:
challenges, models, and datasets. One challenge arises from the disparities between …

3d-gres: Generalized 3d referring expression segmentation

C Wu, Y Liu, J Ji, Y Ma, H Wang, G Luo… - Proceedings of the …, 2024 - dl.acm.org
3D Referring Expression Segmentation (3D-RES) is dedicated to segmenting a specific
instance within a 3D space based on a natural language description. However, current …

Vision-language pre-training via modal interaction

H Cheng, H Ye, X Zhou, X Liu, F Chen, M Wang - Pattern Recognition, 2024 - Elsevier
Existing vision-language pre-training models typically extract region features and conduct
fine-grained local alignment based on masked image/text completion or object detection …

M3ixup: A multi-modal data augmentation approach for image captioning

Y Li, J Ji, X Sun, Y Zhou, Y Luo, R Ji - Pattern Recognition, 2025 - Elsevier
Despite the great success, most models in image captioning (IC) are still stuck in the
dilemma of generating simple and non-discriminative captions. In this paper, we study this …

An ensemble model with attention based mechanism for image captioning

I Al Badarneh, BH Hammo, O Al-Kadi - Computers and Electrical …, 2025 - Elsevier
Image captioning creates informative text from an input image by creating a relationship
between the words and the actual content of an image. Recently, deep learning models that …

ICEAP: An advanced fine-grained image captioning network with enhanced attribute predictor

MB Hossen, Z Ye, A Abdussalam, MA Hossain - Displays, 2024 - Elsevier
Fine-grained image captioning is a focal point in the vision-to-language task and has
attracted considerable attention for generating accurate and contextually relevant image …