Deep image captioning: A review of methods, trends and future challenges

L Xu, Q Tang, J Lv, B Zheng, X Zeng, W Li - Neurocomputing, 2023 - Elsevier
Image captioning, also called report generation in medical field, aims to describe visual
content of images in human language, which requires to model semantic relationship …

Ask, Attend, Attack: A Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models

Q Zeng, Z Wang, Y Cheung, M Jiang - arxiv preprint arxiv:2408.08989, 2024 - arxiv.org
While image-to-text models have demonstrated significant advancements in various vision-
language tasks, they remain susceptible to adversarial attacks. Existing white-box attacks on …

Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image Captioning

J Li, Z Mao, H Li, W Chen, Y Zhang - ACM Transactions on Multimedia …, 2024 - dl.acm.org
Image captioning (IC), bringing vision to language, has drawn extensive attention. A crucial
aspect of IC is the accurate depiction of visual relations among image objects. Visual …

NumCap: a number-controlled multi-caption image captioning network

A Abdussalam, Z Ye, A Hawbani, M Al-Qatf… - ACM Transactions on …, 2023 - dl.acm.org
Image captioning is a promising task that attracted researchers in the last few years. Existing
image captioning models are primarily trained to generate one caption per image. However …

Multi-scale motivated neural network for image-text matching

X Qin, L Li, G Pang - Multimedia Tools and Applications, 2024 - Springer
Existing mainstream image-text matching methods usually measure the relevance of image-
text pairs by capturing and aggregating the affinities between textual words and visual …

Video captioning by learning from global sentence and looking ahead

TZ Niu, ZD Chen, X Luo, PF Zhang, Z Huang… - ACM Transactions on …, 2023 - dl.acm.org
Video captioning aims to automatically generate natural language sentences describing the
content of a video. Although encoder-decoder-based models have achieved promising …

Semantic enhanced video captioning with multi-feature fusion

TZ Niu, SS Dong, ZD Chen, X Luo, S Guo… - ACM Transactions on …, 2023 - dl.acm.org
Video captioning aims to automatically describe a video clip with informative sentences. At
present, deep learning-based models have become the mainstream for this task and …

A2SC: Adversarial Attacks on Subspace Clustering

Y Xu, X Wei, P Dai, X Cao - ACM Transactions on Multimedia Computing …, 2023 - dl.acm.org
Many studies demonstrate that supervised learning techniques are vulnerable to adversarial
examples. However, adversarial threats in unsupervised learning have not drawn sufficient …

Zero-shot scene graph generation via triplet calibration and reduction

J Li, Y Wang, W Li - ACM Transactions on Multimedia Computing …, 2023 - dl.acm.org
Scene Graph Generation (SGG) plays a pivotal role in downstream vision-language tasks.
Existing SGG methods typically suffer from poor compositional generalizations on unseen …

Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval

T Yao, S Peng, L Wang, Y Li, Y Sun - Applied Intelligence, 2024 - Springer
Recent days have seen significant improvements in multi-modal learning made by Vision-
Language Pre-training (VLP) models. However, most of them employ the coarse-grained …