Deep Multimodal Data Fusion

F Zhao, C Zhang, B Geng - ACM Computing Surveys, 2024 - dl.acm.org
Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data
(eg, images, texts, or data collected from different sensors), feature engineering (eg …

Graph transformers: A survey

A Shehzad, F **a, S Abid, C Peng, S Yu… - arxiv preprint arxiv …, 2024 - arxiv.org
Graph transformers are a recent advancement in machine learning, offering a new class of
neural network models for graph-structured data. The synergy between transformers and …

Sentinel mechanism for visual semantic graph-based image captioning

F **ao, N Zhang, W Xue, X Gao - Computers and Electrical Engineering, 2024 - Elsevier
Image captioning aims to generate a description of a given image. However, inherent
representation differences between images and sentences make it difficult to align semantic …

Divide and Conquer: Isolating Normal-Abnormal Attributes in Knowledge Graph-Enhanced Radiology Report Generation

X Liang, Y Zhang, D Wang, H Zhong, R Li… - Proceedings of the 32nd …, 2024 - dl.acm.org
Radiology report generation aims to automatically generate clinical descriptions for
radiology images, reducing the workload of radiologists. Compared to general image …

Effectively Leveraging CLIP for Generating Situational Summaries of Images and Videos

D Verma, D Roy, B Fernando - arxiv preprint arxiv:2407.20642, 2024 - arxiv.org
Situation recognition refers to the ability of an agent to identify and understand various
situations or contexts based on available information and sensory inputs. It involves the …

RefCap: image captioning with referent objects attributes

S Park, J Paik - Scientific Reports, 2023 - nature.com
In recent years, significant progress has been made in visual-linguistic multi-modality
research, leading to advancements in visual comprehension and its applications in …

A multi-view projection-based object-aware graph network for dense captioning of point clouds

Z Ma, Z Yang, A Mao, S Wen, R Yi, Y Liu - Computers & Graphics, 2025 - Elsevier
Abstract 3D dense captioning has received increasing attention in the multimodal field of 3D
vision and language. This task aims to generate a specific descriptive sentence for each …

Eye-movement-prompted large image captioning model

Z Yang, B Han, X Gao, ZH Zhan - Pattern Recognition, 2025 - Elsevier
Pretrained large vision-language models have shown outstanding performance on the task
of image captioning. However, owing to the insufficient decoding of image features, existing …

EdgeScan for IoT Contextual Understanding With Edge Computing and Image Captioning

DA Hafeth, M Al-khafajiy… - IEEE Internet of Things …, 2024 - ieeexplore.ieee.org
The emergence of Edge Computing has shifted the processing capabilities in proximity to
the Internet of Things data sources, offering solutions to latency and bandwidth constraints …

Mining informativeness in scene graphs: Prioritizing informative relations in Scene Graph Generation for enhanced performance in applications

M Neau, PE Santos, AG Bosser, A Macvicar… - Pattern Recognition …, 2025 - Elsevier
Learning to compose visual relationships from raw images in the form of scene graphs is a
highly challenging Computer Vision task, yet it is essential for applications related to scene …