A survey of efficient fine-tuning methods for vision-language models—prompt and adapter

J **ng, J Liu, J Wang, L Sun, X Chen, X Gu… - Computers & Graphics, 2024 - Elsevier
Abstract Vision Language Model (VLM) is a popular research field located at the fusion of
computer vision and natural language processing (NLP). With the emergence of transformer …

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

From pixels to graphs: Open-vocabulary scene graph generation with vision-language models

R Li, S Zhang, D Lin, K Chen… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph
representation for downstream reasoning tasks. Despite recent advancements existing …

Graph neural networks in vision-language image understanding: a survey

H Senior, G Slabaugh, S Yuan, L Rossi - The Visual Computer, 2024 - Springer
Abstract 2D image understanding is a complex problem within computer vision, but it holds
the key to providing human-level scene comprehension. It goes further than identifying the …

OED: towards one-stage end-to-end dynamic scene graph generation

G Wang, Z Li, Q Chen, Y Liu - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com
Abstract Dynamic Scene Graph Generation (DSGG) focuses on identifying visual
relationships within the spatial-temporal domain of videos. Conventional approaches often …

M3S: Scene graph driven multi-granularity multi-task learning for multi-modal NER

J Wang, Y Yang, K Liu, Z Zhu… - IEEE/ACM Transactions on …, 2022 - ieeexplore.ieee.org
Multi-modal Named Entity Recognition (MNER), which mainly focuses on enhancing text-
only NER with visual information, has recently attracted considerable attention. Most current …

Vqa-gnn: Reasoning with multimodal knowledge via graph neural networks for visual question answering

Y Wang, M Yasunaga, H Ren… - Proceedings of the …, 2023 - openaccess.thecvf.com
Visual question answering (VQA) requires systems to perform concept-level reasoning by
unifying unstructured (eg, the context in question and answer;" QA context") and structured …

Multi-level knowledge-driven feature representation and triplet loss optimization network for image–text retrieval

X Qin, L Li, F Hao, M Ge, G Pang - Information Processing & Management, 2024 - Elsevier
Image–text retrieval plays a considerable role in associating vision and language. Existing
mainstream approaches focus on fine-grained alignment while ignoring the influence of …

Multimodal event causality reasoning with scene graph enhanced interaction network

J Liu, K Wei, C Liu - Proceedings of the AAAI Conference on Artificial …, 2024 - ojs.aaai.org
Multimodal event causality reasoning aims to recognize the causal relations based on the
given events and accompanying image pairs, requiring the model to have a comprehensive …

Knowledge-embedded mutual guidance for visual reasoning

W Zheng, L Yan, L Chen, Q Li… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Visual reasoning between visual images and natural language is a long-standing challenge
in computer vision. Most of the methods aim to look for answers to questions only on the …