Minicpm-v: A gpt-4v level mllm on your phone

Y Yao, T Yu, A Zhang, C Wang, J Cui, H Zhu… - arxiv preprint arxiv …, 2024‏ - arxiv.org
The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally
reshaped the landscape of AI research and industry, shedding light on a promising path …

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Y Goyal, T Khot, D Summers-Stay… - Proceedings of the …, 2017‏ - openaccess.thecvf.com
Problems at the intersection of vision and language are of significant importance both as
challenging research questions and for the rich set of applications they enable. However …

LRTA: A transparent neural-symbolic reasoning framework with modular supervision for visual question answering

W Liang, F Niu, A Reganti, G Thattai, G Tur - arxiv preprint arxiv …, 2020‏ - arxiv.org
The predominant approach to visual question answering (VQA) relies on encoding the
image and question with a" black-box" neural encoder and decoding a single token as the …

COCO is “ALL” You Need for Visual Instruction Fine-tuning

X Han, Y Wang, B Zhai, Q You… - 2024 IEEE International …, 2024‏ - ieeexplore.ieee.org
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of
artificial intelligence. Visual instruction fine-tuning (IFT) is a vital process for aligning MLLMs' …

S-VQA: Sentence-Based Visual Question Answering

S Pathak, G Singh, A Anand, P Guha - Proceedings of the Fourteenth …, 2023‏ - dl.acm.org
Visual Question Answering (VQA) system responds to a natural language question in
context of an image. This problem has been primarily formulated as a classification problem …

Customized image narrative generation via interactive visual question generation and answering

A Shin, Y Ushiku, T Harada - Proceedings of the IEEE …, 2018‏ - openaccess.thecvf.com
Image description task has been invariably examined in a static manner with qualitative
presumptions held to be universally applicable, regardless of the scope or target of the …

StackOverflowVQA: Stack Overflow Visual Question Answering Dataset

M Mirzaei, MJ Pirhadi, S Eetemadi - arxiv preprint arxiv:2405.10736, 2024‏ - arxiv.org
In recent years, people have increasingly used AI to help them with their problems by asking
questions on different topics. One of these topics can be software-related and programming …

Efficient GPT-4V Level Multimodal Large Language Model for Deployment on Edge Devices

Y Yao, T Yu, A Zhang, C Wang, J Cui, H Zhu, T Cai… - 2025‏ - researchsquare.com
The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally
reshaped the landscape of AI research and industry, shedding light on a promising path …

Multimodal Learning for Accurate Visual Question Answering: An Attention-based Approach

J Bhardwaj, A Balakrishnan, S Pathak… - Proceedings of the …, 2023‏ - aclanthology.org
This paper proposes an open-ended task for Visual Question Answering (VQA) that
leverages the InceptionV3 Object Detection model and an attention-based Long Short-Term …

[PDF][PDF] Generate Answer to Visual Questions with Pre-trained Vision-and-Language Embeddings

H Sheikhi, M Hashemi, S Eetemadi - WiNLP Workshop at EMNLP, 2022‏ - karlancer.com
Abstract Visual Question Answering is a multi-modal task under the consideration of both the
Vision and Language communities. Present VQA models are limited to classification …