X-llm: Bootstrap** advanced large language models by treating multi-modalities as foreign languages
Large language models (LLMs) have demonstrated remarkable language abilities. GPT-4,
based on advanced LLMs, exhibits extraordinary multimodal capabilities beyond previous …
based on advanced LLMs, exhibits extraordinary multimodal capabilities beyond previous …
Towards top-down reasoning: An explainable multi-agent approach for visual question answering
Recently, several methods have been proposed to augment large Vision Language Models
(VLMs) for Visual Question Answering (VQA) simplicity by incorporating external knowledge …
(VLMs) for Visual Question Answering (VQA) simplicity by incorporating external knowledge …
Visdiahalbench: A visual dialogue benchmark for diagnosing hallucination in large vision-language models
Despite the significant success of large vision-language models (LVLMs), some studies
have revealed that LVLMs suffer from the hallucination problem, where the LVLMs' response …
have revealed that LVLMs suffer from the hallucination problem, where the LVLMs' response …
Zrigf: An innovative multimodal framework for zero-resource image-grounded dialogue generation
Image-grounded dialogue systems benefit greatly from integrating visual information,
resulting in high-quality response generation. However, current models struggle to …
resulting in high-quality response generation. However, current models struggle to …
CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos
Visual information is central to conversation: body gestures and physical behaviour, for
example, contribute to meaning that transcends words alone. To date, however, most neural …
example, contribute to meaning that transcends words alone. To date, however, most neural …
Structure-Aware Multimodal Sequential Learning for Visual Dialog
YJ Kim, MJ Kim, K An, J Ahn, J Kim, YJ Heo… - Proceedings of the …, 2024 - ojs.aaai.org
With the ability to collect vast amounts of image and natural language data from the web,
there has been a remarkable advancement in Large-scale Language Models (LLMs). This …
there has been a remarkable advancement in Large-scale Language Models (LLMs). This …
[HTML][HTML] A fine-grained deconfounding study for knowledge-based visual dialog
AA Liu, Q Wu, C Huang, C Xue, X Liu, N Xu - Visual Informatics, 2024 - Elsevier
Abstract Knowledge-based Visual Dialog is a challenging vision-language task, where an
agent engages in dialog to answer questions with humans based on the input image and …
agent engages in dialog to answer questions with humans based on the input image and …
FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval
J **e, J Kuang, Z Lin, J Ouyang, Z Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
Given a query from one modality, few-shot cross-modal retrieval (CMR) retrieves
semantically similar instances in another modality with the target domain including classes …
semantically similar instances in another modality with the target domain including classes …
Share What You Already Know: Cross-Language-Script Transfer and Alignment for Sentiment Detection in Code-Mixed Data
Code-switching entails mixing multiple languages. It is an increasingly occurring
phenomenon in social media texts. Usually, code-mixed texts are written in a single script …
phenomenon in social media texts. Usually, code-mixed texts are written in a single script …
Multi-round dialogue state tracking by object-entity alignment in visual dialog
W Pang - CAAI International Conference on Artificial Intelligence, 2023 - Springer
Visual Dialog (VD) is a task where an agent answers a series of image-related questions
based on a multi-round dialog history. However, previous VD methods often treat the entire …
based on a multi-round dialog history. However, previous VD methods often treat the entire …