Enhancing large vision language models with self-training on image comprehension

Y Deng, P Lu, F Yin, Z Hu, S Shen, Q Gu, J Zou… - ar**
GC Kang, J Kim, J Kim, BT Zhang - 2024 IEEE International …, 2024‏ - ieeexplore.ieee.org
Interactive Object Gras** (IOG) is the task of identifying and gras** the desired object
via human-robot natural language interaction. Current IOG systems assume that a human …

Enabling harmonious human-machine interaction with visual-context augmented dialogue system: A review

H Wang, B Guo, Y Zeng, M Chen, Y Ding… - ACM Transactions on …, 2022‏ - dl.acm.org
The intelligent dialogue system, aiming at communicating with humans harmoniously with
natural language, is brilliant for promoting the advancement of human-machine interaction …

Retrieval across any domains via large-scale pre-trained model

J Yan, Z Yin, C Xu, C Deng, H Huang - Forty-first International …, 2024‏ - openreview.net
In order to enhance the generalization ability towards unseen domains, universal cross-
domain image retrieval methods require a training dataset encompassing diverse domains …

VD-GR: boosting visual dialog with cascaded spatial-temporal multi-modal graphs

A Abdessaied, L Shi, A Bulling - Proceedings of the IEEE …, 2024‏ - openaccess.thecvf.com
We propose VD-GR--a novel visual dialog model that combines pre-trained language
models (LMs) with graph neural networks (GNNs). Prior works mainly focused on one class …

InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models

B Wen, Z Yang, J Wang, Z Gan, B Howe… - arxiv preprint arxiv …, 2023‏ - arxiv.org
In this paper, we build a visual dialogue dataset, named InfoVisDial, which provides rich
informative answers in each round even with external knowledge related to the visual …

Synthesizing Sentiment-Controlled Feedback For Multimodal Text and Image Data

P Kumar, S Malik, B Raman, X Li - arxiv preprint arxiv:2402.07640, 2024‏ - arxiv.org
The ability to generate sentiment-controlled feedback in response to multimodal inputs
comprising text and images addresses a critical gap in human-computer interaction. This …