From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic

Y Tewel, Y Shalev, I Schwartz… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Recent text-to-image matching models apply contrastive learning to large corpora of
uncurated pairs of images and sentences. While such models can provide a powerful score …

Cross-modal text and visual generation: A systematic review. Part 1: Image to text

M Żelaszczyk, J Mańdziuk - Information Fusion, 2023 - Elsevier
We review the existing literature on generating text from visual data under the cross-modal
generation umbrella, which affords us to compare and contrast various approaches taking …

Language models can see: Plugging visual controls in text generation

Y Su, T Lan, Y Liu, F Liu, D Yogatama, Y Wang… - arxiv preprint arxiv …, 2022 - arxiv.org
Generative language models (LMs) such as GPT-2/3 can be prompted to generate text with
remarkable quality. While they are designed for text-prompted generation, it remains an …

Using AI and social media multimodal content for disaster response and management: Opportunities, challenges, and future directions

M Imran, F Ofli, D Caragea, A Torralba - Information Processing & …, 2020 - Elsevier
Abstract People increasingly use Social Media (SM) platforms such as Twitter and Facebook
during disasters and emergencies to post situational updates including reports of injured or …

Trends in integration of vision and language research: A survey of tasks, datasets, and methods

A Mogadala, M Kalimuthu, D Klakow - Journal of Artificial Intelligence …, 2021 - jair.org
Abstract Interest in Artificial Intelligence (AI) and its applications has seen unprecedented
growth in the last few years. This success can be partly attributed to the advancements made …

Context-aware visual policy network for fine-grained image captioning

ZJ Zha, D Liu, H Zhang, Y Zhang… - IEEE transactions on …, 2019 - ieeexplore.ieee.org
With the maturity of visual detection techniques, we are more ambitious in describing visual
content with open-vocabulary, fine-grained and free-form language, ie, the task of image …

Removing bias in multi-modal classifiers: Regularization by maximizing functional entropies

I Gat, I Schwartz, A Schwing… - Advances in Neural …, 2020 - proceedings.neurips.cc
Many recent datasets contain a variety of different data modalities, for instance, image,
question, and answer data in visual question answering (VQA). When training deep net …

Factor graph attention

I Schwartz, S Yu, T Hazan… - Proceedings of the …, 2019 - openaccess.thecvf.com
Dialog is an effective way to exchange information, but subtle details and nuances are
extremely important. While significant progress has paved a path to address visual dialog …

[PDF][PDF] Zero-shot image-to-text generation for visual-semantic arithmetic

Y Tewel, Y Shalev, I Schwartz, L Wolf - arxiv preprint arxiv …, 2021 - academia.edu
Recent text-to-image matching models apply contrastive learning to large corpora of
uncurated pairs of images and sentences. While such models can provide a powerful score …