Knowledge graphs meet multi-modal learning: A comprehensive survey

Z Chen, Y Zhang, Y Fang, Y Geng, L Guo… - arxiv preprint arxiv …, 2024 - arxiv.org
Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …

Unified-io: A unified model for vision, language, and multi-modal tasks

J Lu, C Clark, R Zellers, R Mottaghi… - The Eleventh …, 2022 - openreview.net
We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical
computer vision tasks, including pose estimation, object detection, depth estimation and …

Multi-modal knowledge graph construction and application: A survey

X Zhu, Z Li, X Wang, X Jiang, P Sun… - … on Knowledge and …, 2022 - ieeexplore.ieee.org
Recent years have witnessed the resurgence of knowledge engineering which is featured
by the fast growth of knowledge graphs. However, most of existing knowledge graphs are …

Clip-event: Connecting text and images with event structures

M Li, R Xu, S Wang, L Zhou, X Lin… - Proceedings of the …, 2022 - openaccess.thecvf.com
Abstract Vision-language (V+ L) pretraining models have achieved great success in
supporting multimedia applications by understanding the alignments between images and …

Going beyond nouns with vision & language models using synthetic data

P Cascante-Bonilla, K Shehada… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large-scale pre-trained Vision & Language (VL) models have shown remarkable
performance in many applications, enabling replacing a fixed set of supported classes with …

Teaching structured vision & language concepts to vision & language models

S Doveh, A Arbelle, S Harary… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in
a variety of tasks. However, some aspects of complex language understanding still remain a …

Dense and aligned captions (dac) promote compositional reasoning in vl models

S Doveh, A Arbelle, S Harary… - Advances in …, 2023 - proceedings.neurips.cc
Vision and Language (VL) models offer an effective method for aligning representation
spaces of images and text allowing for numerous applications such as cross-modal retrieval …

Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations

T Zhao, T Zhang, M Zhu, H Shen, K Lee, X Lu… - arxiv preprint arxiv …, 2022 - arxiv.org
Vision-Language Pretraining (VLP) models have recently successfully facilitated many cross-
modal downstream tasks. Most existing works evaluated their systems by comparing the fine …

VALSE: A task-independent benchmark for vision and language models centered on linguistic phenomena

L Parcalabescu, M Cafagna, L Muradjan… - arxiv preprint arxiv …, 2021 - arxiv.org
We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark
designed for testing general-purpose pretrained vision and language (V&L) models for their …

Learning transferable human-object interaction detector with natural language supervision

S Wang, Y Duan, H Ding, YP Tan… - Proceedings of the …, 2022 - openaccess.thecvf.com
It is difficult to construct a data collection including all possible combinations of human
actions and interacting objects due to the combinatorial nature of human-object interactions …