Vlp: A survey on vision-language pre-training
In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …
such as computer vision (CV) and natural language processing (NLP) to a new era …
X-llm: Bootstrap** advanced large language models by treating multi-modalities as foreign languages
Large language models (LLMs) have demonstrated remarkable language abilities. GPT-4,
based on advanced LLMs, exhibits extraordinary multimodal capabilities beyond previous …
based on advanced LLMs, exhibits extraordinary multimodal capabilities beyond previous …
GoG: Relation-aware graph-over-graph network for visual dialog
Visual dialog, which aims to hold a meaningful conversation with humans about a given
image, is a challenging task that requires models to reason the complex dependencies …
image, is a challenging task that requires models to reason the complex dependencies …
Improving cross-modal understanding in visual dialog via contrastive learning
Visual Dialog is a challenging vision-language task since the visual dialog agent needs to
answer a series of questions after reasoning over both the image content and dialog history …
answer a series of questions after reasoning over both the image content and dialog history …
The dialog must go on: Improving visual dialog via generative self-training
Visual dialog (VisDial) is a task of answering a sequence of questions grounded in an
image, using the dialog history as context. Prior work has trained the dialog agents solely on …
image, using the dialog history as context. Prior work has trained the dialog agents solely on …
KBGN: Knowledge-bridge graph network for adaptive vision-text reasoning in visual dialogue
Visual dialogue is a challenging task that needs to extract implicit information from both
visual (image) and textual (dialogue history) contexts. Classical approaches pay more …
visual (image) and textual (dialogue history) contexts. Classical approaches pay more …
Reasoning with multi-structure commonsense knowledge in visual dialog
Visual Dialog requires an agent to engage in a conversation with humans grounded in an
image. Many studies on Visual Dialog focus on the understanding of the dialog history or the …
image. Many studies on Visual Dialog focus on the understanding of the dialog history or the …
Unsupervised and pseudo-supervised vision-language alignment in visual dialog
Visual dialog requires models to give reasonable answers according to a series of coherent
questions and related visual concepts in images. However, most current work either focuses …
questions and related visual concepts in images. However, most current work either focuses …
HVLM: Exploring human-like visual cognition and language-memory network for visual dialog
K Sun, C Guo, H Zhang, Y Li - Information Processing & Management, 2022 - Elsevier
Visual dialog, a visual-language task, enables an AI agent to engage in conversation with
humans grounded in a given image. To generate appropriate answers for a series of …
humans grounded in a given image. To generate appropriate answers for a series of …
Learning dual encoding model for adaptive visual understanding in visual dialogue
Different from Visual Question Answering task that requires to answer only one question
about an image, Visual Dialogue task involves multiple rounds of dialogues which cover a …
about an image, Visual Dialogue task involves multiple rounds of dialogues which cover a …