Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

Evaluation of text generation: A survey

A Celikyilmaz, E Clark, J Gao - arxiv preprint arxiv:2006.14799, 2020 - arxiv.org
The paper surveys evaluation methods of natural language generation (NLG) systems that
have been developed in the last few years. We group NLG evaluation methods into three …

Dip: Dual incongruity perceiving network for sarcasm detection

C Wen, G Jia, J Yang - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
Sarcasm indicates the literal meaning is contrary to the real attitude. Considering the
popularity and complementarity of image-text data, we investigate the task of multi-modal …

[HTML][HTML] A systematic literature review on image captioning

R Staniūtė, D Šešok - Applied Sciences, 2019 - mdpi.com
Natural language problems have already been investigated for around five years. Recent
progress in artificial intelligence (AI) has greatly improved the performance of models …

Language models can see: Plugging visual controls in text generation

Y Su, T Lan, Y Liu, F Liu, D Yogatama, Y Wang… - arxiv preprint arxiv …, 2022 - arxiv.org
Generative language models (LMs) such as GPT-2/3 can be prompted to generate text with
remarkable quality. While they are designed for text-prompted generation, it remains an …

Zero-shot video object segmentation with co-attention siamese networks

X Lu, W Wang, J Shen, D Crandall… - IEEE transactions on …, 2020 - ieeexplore.ieee.org
We introduce a novel network, called CO-attention siamese network (COSNet), to address
the zero-shot video object segmentation task in a holistic fashion. We exploit the inherent …

Gpt-4v (ision) as a social media analysis engine

H Lyu, J Huang, D Zhang, Y Yu, X Mou, J Pan… - arxiv preprint arxiv …, 2023 - arxiv.org
Recent research has offered insights into the extraordinary capabilities of Large Multimodal
Models (LMMs) in various general vision and language tasks. There is growing interest in …

Emotional video captioning with vision-based emotion interpretation network

P Song, D Guo, X Yang, S Tang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Effectively summarizing and re-expressing video content by natural languages in a more
human-like fashion is one of the key topics in the field of multimedia content understanding …

Style-aware contrastive learning for multi-style image captioning

Y Zhou, G Long - arxiv preprint arxiv:2301.11367, 2023 - arxiv.org
Existing multi-style image captioning methods show promising results in generating a
caption with accurate visual content and desired linguistic style. However, existing methods …

Human-like controllable image captioning with verb-specific semantic roles

L Chen, Z Jiang, J **ao, W Liu - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Abstract Controllable Image Captioning (CIC)--generating image descriptions following
designated control signals--has received unprecedented attention over the last few years …