From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

Vision Transformers in medical computer vision—A contemplative retrospection

A Parvaiz, MA Khalid, R Zafar, H Ameer, M Ali… - … Applications of Artificial …, 2023 - Elsevier
Abstract Vision Transformers (ViTs), with the magnificent potential to unravel the information
contained within images, have evolved as one of the most contemporary and dominant …

[PDF][PDF] The dawn of lmms: Preliminary explorations with gpt-4v (ision)

Z Yang, L Li, K Lin, J Wang, CC Lin… - arxiv preprint arxiv …, 2023 - stableaiprompts.com
Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory
skills, such as visual understanding, to achieve stronger generic intelligence. In this paper …

Dense text-to-image generation with attention modulation

Y Kim, J Lee, JH Kim, JW Ha… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Existing text-to-image diffusion models struggle to synthesize realistic images given dense
captions, where each text prompt provides a detailed description for a specific image region …

Interactive and explainable region-guided radiology report generation

T Tanida, P Müller, G Kaissis… - Proceedings of the …, 2023 - openaccess.thecvf.com
The automatic generation of radiology reports has the potential to assist radiologists in the
time-consuming task of report writing. Existing methods generate the full report from image …

[HTML][HTML] High-precision multiclass classification of lung disease through customized MobileNetV2 from chest X-ray images

FMJM Shamrat, S Azam, A Karim, K Ahmed… - Computers in Biology …, 2023 - Elsevier
In this study, multiple lung diseases are diagnosed with the help of the Neural Network
algorithm. Specifically, Emphysema, Infiltration, Mass, Pleural Thickening, Pneumonia …

Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition

SC Huang, L Shen, MP Lungren… - Proceedings of the …, 2021 - openaccess.thecvf.com
In recent years, the growing number of medical imaging studies is placing an ever-
increasing burden on radiologists. Deep learning provides a promising solution for …

Expressive text-to-image generation with rich text

S Ge, T Park, JY Zhu, JB Huang - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Plain text has become a prevalent interface for text-to-image synthesis. However, its limited
customization options hinder users from accurately describing desired outputs. For example …

[HTML][HTML] Pre-trained models: Past, present and future

X Han, Z Zhang, N Ding, Y Gu, X Liu, Y Huo, J Qiu… - AI Open, 2021 - Elsevier
Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved
great success and become a milestone in the field of artificial intelligence (AI). Owing to …

Grit: A generative region-to-text transformer for object understanding

J Wu, J Wang, Z Yang, Z Gan, Z Liu, J Yuan… - European Conference on …, 2024 - Springer
This paper presents a Generative RegIon-to-Text transformer, GRiT, for object
understanding. The spirit of GRiT is to formulate object understanding as< region, text> …