Knowledge graphs meet multi-modal learning: A comprehensive survey
Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …
semantic web community's exploration into multi-modal dimensions unlocking new avenues …
Unified-io: A unified model for vision, language, and multi-modal tasks
We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical
computer vision tasks, including pose estimation, object detection, depth estimation and …
computer vision tasks, including pose estimation, object detection, depth estimation and …
Multi-modal knowledge graph construction and application: A survey
Recent years have witnessed the resurgence of knowledge engineering which is featured
by the fast growth of knowledge graphs. However, most of existing knowledge graphs are …
by the fast growth of knowledge graphs. However, most of existing knowledge graphs are …
Clip-event: Connecting text and images with event structures
Abstract Vision-language (V+ L) pretraining models have achieved great success in
supporting multimedia applications by understanding the alignments between images and …
supporting multimedia applications by understanding the alignments between images and …
Going beyond nouns with vision & language models using synthetic data
P Cascante-Bonilla, K Shehada… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large-scale pre-trained Vision & Language (VL) models have shown remarkable
performance in many applications, enabling replacing a fixed set of supported classes with …
performance in many applications, enabling replacing a fixed set of supported classes with …
Teaching structured vision & language concepts to vision & language models
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in
a variety of tasks. However, some aspects of complex language understanding still remain a …
a variety of tasks. However, some aspects of complex language understanding still remain a …
Dense and aligned captions (dac) promote compositional reasoning in vl models
Vision and Language (VL) models offer an effective method for aligning representation
spaces of images and text allowing for numerous applications such as cross-modal retrieval …
spaces of images and text allowing for numerous applications such as cross-modal retrieval …
Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations
Vision-Language Pretraining (VLP) models have recently successfully facilitated many cross-
modal downstream tasks. Most existing works evaluated their systems by comparing the fine …
modal downstream tasks. Most existing works evaluated their systems by comparing the fine …
VALSE: A task-independent benchmark for vision and language models centered on linguistic phenomena
L Parcalabescu, M Cafagna, L Muradjan… - arxiv preprint arxiv …, 2021 - arxiv.org
We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark
designed for testing general-purpose pretrained vision and language (V&L) models for their …
designed for testing general-purpose pretrained vision and language (V&L) models for their …
Learning transferable human-object interaction detector with natural language supervision
It is difficult to construct a data collection including all possible combinations of human
actions and interacting objects due to the combinatorial nature of human-object interactions …
actions and interacting objects due to the combinatorial nature of human-object interactions …