From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation

T Wu, G Yang, Z Li, K Zhang, Z Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Despite recent advances in text-to-3D generative methods there is a notable absence of
reliable evaluation metrics. Existing metrics usually focus on a single criterion each such as …

Gpt4point: A unified framework for point-language understanding and generation

Z Qi, Y Fang, Z Sun, X Wu, T Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Multimodal Large Language Models (MLLMs) have excelled in 2D image-text
comprehension and image generation but their understanding of the 3D world is notably …

The Neglected Tails in Vision-Language Models

S Parashar, Z Lin, T Liu, X Dong, Y Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision-language models (VLMs) excel in zero-shot recognition but their performance varies
greatly across different visual concepts. For example although CLIP achieves impressive …

Benchlmm: Benchmarking cross-style visual capability of large multimodal models

R Cai, Z Song, D Guan, Z Chen, Y Li, X Luo… - … on Computer Vision, 2024 - Springer
Abstract Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown
remarkable capabilities in visual reasoning on data in common image styles. However, their …

High-order interaction learning for image captioning

Y Wang, N Xu, AA Liu, W Li… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Image captioning aims at understanding various semantic concepts (eg, objects and
relationships) from an image and integrating them in a sentence-level description. Hence, it …

One-dimensional Adapter to Rule Them All: Concepts Diffusion Models and Erasing Applications

M Lyu, Y Yang, H Hong, H Chen, X **… - Proceedings of the …, 2024 - openaccess.thecvf.com
The prevalent use of commercial and open-source diffusion models (DMs) for text-to-image
generation prompts risk mitigation to prevent undesired behaviors. Existing concept erasing …

Deep image captioning: A review of methods, trends and future challenges

L Xu, Q Tang, J Lv, B Zheng, X Zeng, W Li - Neurocomputing, 2023 - Elsevier
Image captioning, also called report generation in medical field, aims to describe visual
content of images in human language, which requires to model semantic relationship …

Visuals to text: A comprehensive review on automatic image captioning

Y Ming, N Hu, C Fan, F Feng… - IEEE/CAA Journal of …, 2022 - researchportal.port.ac.uk
Image captioning refers to automatic generation of descriptive texts according to the visual
content of images. It is a technique integrating multiple disciplines including the computer …