X-llm: Bootstrap** advanced large language models by treating multi-modalities as foreign languages

F Chen, M Han, H Zhao, Q Zhang, J Shi, S Xu… - arxiv preprint arxiv …, 2023 - arxiv.org
Large language models (LLMs) have demonstrated remarkable language abilities. GPT-4,
based on advanced LLMs, exhibits extraordinary multimodal capabilities beyond previous …

Towards top-down reasoning: An explainable multi-agent approach for visual question answering

Z Wang, W Wan, Q Lao, R Chen, M Lang… - arxiv preprint arxiv …, 2023 - arxiv.org
Recently, several methods have been proposed to augment large Vision Language Models
(VLMs) for Visual Question Answering (VQA) simplicity by incorporating external knowledge …

Visdiahalbench: A visual dialogue benchmark for diagnosing hallucination in large vision-language models

Q Cao, J Cheng, X Liang, L Lin - … of the 62nd Annual Meeting of …, 2024 - aclanthology.org
Despite the significant success of large vision-language models (LVLMs), some studies
have revealed that LVLMs suffer from the hallucination problem, where the LVLMs' response …

Zrigf: An innovative multimodal framework for zero-resource image-grounded dialogue generation

B Zhang, J Wang, H Ma, B Xu, H Lin - Proceedings of the 31st ACM …, 2023 - dl.acm.org
Image-grounded dialogue systems benefit greatly from integrating visual information,
resulting in high-quality response generation. However, current models struggle to …

CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos

S Han, J Hessel, N Dziri, Y Choi… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Visual information is central to conversation: body gestures and physical behaviour, for
example, contribute to meaning that transcends words alone. To date, however, most neural …

Structure-Aware Multimodal Sequential Learning for Visual Dialog

YJ Kim, MJ Kim, K An, J Ahn, J Kim, YJ Heo… - Proceedings of the …, 2024 - ojs.aaai.org
With the ability to collect vast amounts of image and natural language data from the web,
there has been a remarkable advancement in Large-scale Language Models (LLMs). This …

[HTML][HTML] A fine-grained deconfounding study for knowledge-based visual dialog

AA Liu, Q Wu, C Huang, C Xue, X Liu, N Xu - Visual Informatics, 2024 - Elsevier
Abstract Knowledge-based Visual Dialog is a challenging vision-language task, where an
agent engages in dialog to answer questions with humans based on the input image and …

FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval

J **e, J Kuang, Z Lin, J Ouyang, Z Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
Given a query from one modality, few-shot cross-modal retrieval (CMR) retrieves
semantically similar instances in another modality with the target domain including classes …

Share What You Already Know: Cross-Language-Script Transfer and Alignment for Sentiment Detection in Code-Mixed Data

N Pahari, K Shimada - ACM Transactions on Asian and Low-Resource …, 2024 - dl.acm.org
Code-switching entails mixing multiple languages. It is an increasingly occurring
phenomenon in social media texts. Usually, code-mixed texts are written in a single script …

Multi-round dialogue state tracking by object-entity alignment in visual dialog

W Pang - CAAI International Conference on Artificial Intelligence, 2023 - Springer
Visual Dialog (VD) is a task where an agent answers a series of image-related questions
based on a multi-round dialog history. However, previous VD methods often treat the entire …