Vlp: A survey on vision-language pre-training

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer
In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

X-llm: Bootstrap** advanced large language models by treating multi-modalities as foreign languages

F Chen, M Han, H Zhao, Q Zhang, J Shi, S Xu… - arxiv preprint arxiv …, 2023 - arxiv.org
Large language models (LLMs) have demonstrated remarkable language abilities. GPT-4,
based on advanced LLMs, exhibits extraordinary multimodal capabilities beyond previous …

GoG: Relation-aware graph-over-graph network for visual dialog

F Chen, X Chen, F Meng, P Li, J Zhou - arxiv preprint arxiv:2109.08475, 2021 - arxiv.org
Visual dialog, which aims to hold a meaningful conversation with humans about a given
image, is a challenging task that requires models to reason the complex dependencies …

Improving cross-modal understanding in visual dialog via contrastive learning

F Chen, X Chen, S Xu, B Xu - ICASSP 2022-2022 IEEE …, 2022 - ieeexplore.ieee.org
Visual Dialog is a challenging vision-language task since the visual dialog agent needs to
answer a series of questions after reasoning over both the image content and dialog history …

The dialog must go on: Improving visual dialog via generative self-training

GC Kang, S Kim, JH Kim, D Kwak… - Proceedings of the …, 2023 - openaccess.thecvf.com
Visual dialog (VisDial) is a task of answering a sequence of questions grounded in an
image, using the dialog history as context. Prior work has trained the dialog agents solely on …

KBGN: Knowledge-bridge graph network for adaptive vision-text reasoning in visual dialogue

X Jiang, S Du, Z Qin, Y Sun, J Yu - Proceedings of the 28th ACM …, 2020 - dl.acm.org
Visual dialogue is a challenging task that needs to extract implicit information from both
visual (image) and textual (dialogue history) contexts. Classical approaches pay more …

Reasoning with multi-structure commonsense knowledge in visual dialog

S Zhang, X Jiang, Z Yang, T Wan… - Proceedings of the …, 2022 - openaccess.thecvf.com
Visual Dialog requires an agent to engage in a conversation with humans grounded in an
image. Many studies on Visual Dialog focus on the understanding of the dialog history or the …

Unsupervised and pseudo-supervised vision-language alignment in visual dialog

F Chen, D Zhang, X Chen, J Shi, S Xu… - Proceedings of the 30th …, 2022 - dl.acm.org
Visual dialog requires models to give reasonable answers according to a series of coherent
questions and related visual concepts in images. However, most current work either focuses …

HVLM: Exploring human-like visual cognition and language-memory network for visual dialog

K Sun, C Guo, H Zhang, Y Li - Information Processing & Management, 2022 - Elsevier
Visual dialog, a visual-language task, enables an AI agent to engage in conversation with
humans grounded in a given image. To generate appropriate answers for a series of …

Learning dual encoding model for adaptive visual understanding in visual dialogue

J Yu, X Jiang, Z Qin, W Zhang, Y Hu… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Different from Visual Question Answering task that requires to answer only one question
about an image, Visual Dialogue task involves multiple rounds of dialogues which cover a …