From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities

MF Ishmam, MSH Shovon, MF Mridha, N Dey - Information Fusion, 2024‏ - Elsevier
The multimodal task of Visual Question Answering (VQA) encompassing elements of
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …

Saco loss: Sample-wise affinity consistency for vision-language pre-training

S Wu, H Tan, Z Tian, Y Chen… - Proceedings of the …, 2024‏ - openaccess.thecvf.com
Vision-language pre-training (VLP) aims to learn joint representations of vision and
language modalities. The contrastive paradigm is currently dominant in this field. However …

Mllms-augmented visual-language representation learning

Y Liu, K Wang, W Shao, P Luo, Y Qiao… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Visual-language pre-training has achieved remarkable success in many multi-modal tasks,
largely attributed to the availability of large-scale image-text datasets. In this work, we …

Cosmo: Contrastive streamlined multimodal model with interleaved pre-training

AJ Wang, L Li, KQ Lin, J Wang, K Lin, Z Yang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
In the evolution of Vision-Language Pre-training, shifting from short-text comprehension to
encompassing extended textual contexts is pivotal. Recent autoregressive vision-language …

Mafa: Managing false negatives for vision-language pre-training

J Byun, D Kim, T Moon - … of the IEEE/CVF Conference on …, 2024‏ - openaccess.thecvf.com
We consider a critical issue of false negatives in Vision-Language Pre-training (VLP) a
challenge that arises from the inherent many-to-many correspondence of image-text pairs in …

Data-efficient multimodal fusion on a single gpu

N Vouitsis, Z Liu, SK Gorti… - Proceedings of the …, 2024‏ - openaccess.thecvf.com
The goal of multimodal alignment is to learn a single latent space that is shared between
multimodal inputs. The most powerful models in this space have been trained using massive …

Clip-cid: Efficient clip distillation via cluster-instance discrimination

K Yang, T Gu, X An, H Jiang, X Dai, Z Feng… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Contrastive Language-Image Pre-training (CLIP) has achieved excellent performance over
a wide range of tasks. However, the effectiveness of CLIP heavily relies on a substantial …

Active data curation effectively distills large-scale multimodal models

V Udandarao, N Parthasarathy, MF Naeem… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into
smaller ones. Prior works have explored ever more complex KD strategies involving different …

Code less, align more: Efficient llm fine-tuning for code generation with data pruning

YD Tsai, M Liu, H Ren - arxiv preprint arxiv:2407.05040, 2024‏ - arxiv.org
Recent work targeting large language models (LLMs) for code generation demonstrated that
increasing the amount of training data through synthetic code generation often leads to …

VG-CALF: A vision-guided cross-attention and late-fusion network for radiology images in Medical Visual Question Answering

A Lameesa, C Silpasuwanchai, MSB Alam - Neurocomputing, 2025‏ - Elsevier
Image and question matching is essential in Medical Visual Question Answering (MVQA) in
order to accurately assess the visual-semantic correspondence between an image and a …