Deep image captioning: A review of methods, trends and future challenges
Image captioning, also called report generation in medical field, aims to describe visual
content of images in human language, which requires to model semantic relationship …
content of images in human language, which requires to model semantic relationship …
Ask, Attend, Attack: A Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models
While image-to-text models have demonstrated significant advancements in various vision-
language tasks, they remain susceptible to adversarial attacks. Existing white-box attacks on …
language tasks, they remain susceptible to adversarial attacks. Existing white-box attacks on …
Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image Captioning
Image captioning (IC), bringing vision to language, has drawn extensive attention. A crucial
aspect of IC is the accurate depiction of visual relations among image objects. Visual …
aspect of IC is the accurate depiction of visual relations among image objects. Visual …
NumCap: a number-controlled multi-caption image captioning network
Image captioning is a promising task that attracted researchers in the last few years. Existing
image captioning models are primarily trained to generate one caption per image. However …
image captioning models are primarily trained to generate one caption per image. However …
Multi-scale motivated neural network for image-text matching
X Qin, L Li, G Pang - Multimedia Tools and Applications, 2024 - Springer
Existing mainstream image-text matching methods usually measure the relevance of image-
text pairs by capturing and aggregating the affinities between textual words and visual …
text pairs by capturing and aggregating the affinities between textual words and visual …
Video captioning by learning from global sentence and looking ahead
Video captioning aims to automatically generate natural language sentences describing the
content of a video. Although encoder-decoder-based models have achieved promising …
content of a video. Although encoder-decoder-based models have achieved promising …
Semantic enhanced video captioning with multi-feature fusion
Video captioning aims to automatically describe a video clip with informative sentences. At
present, deep learning-based models have become the mainstream for this task and …
present, deep learning-based models have become the mainstream for this task and …
A2SC: Adversarial Attacks on Subspace Clustering
Many studies demonstrate that supervised learning techniques are vulnerable to adversarial
examples. However, adversarial threats in unsupervised learning have not drawn sufficient …
examples. However, adversarial threats in unsupervised learning have not drawn sufficient …
Zero-shot scene graph generation via triplet calibration and reduction
J Li, Y Wang, W Li - ACM Transactions on Multimedia Computing …, 2023 - dl.acm.org
Scene Graph Generation (SGG) plays a pivotal role in downstream vision-language tasks.
Existing SGG methods typically suffer from poor compositional generalizations on unseen …
Existing SGG methods typically suffer from poor compositional generalizations on unseen …
Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval
T Yao, S Peng, L Wang, Y Li, Y Sun - Applied Intelligence, 2024 - Springer
Recent days have seen significant improvements in multi-modal learning made by Vision-
Language Pre-training (VLP) models. However, most of them employ the coarse-grained …
Language Pre-training (VLP) models. However, most of them employ the coarse-grained …