Knowledge graphs meet multi-modal learning: A comprehensive survey

Z Chen, Y Zhang, Y Fang, Y Geng, L Guo… - arxiv preprint arxiv …, 2024 - arxiv.org
Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …

Neural motifs: Scene graph parsing with global context

R Zellers, M Yatskar, S Thomson… - Proceedings of the …, 2018 - openaccess.thecvf.com
We investigate the problem of producing structured graph representations of visual scenes.
Our work analyzes the role of motifs: regularly appearing substructures in scene graphs. We …

Visual spatial reasoning

F Liu, G Emerson, N Collier - Transactions of the Association for …, 2023 - direct.mit.edu
Spatial relations are a basic part of human cognition. However, they are expressed in
natural language in a variety of ways, and previous work has suggested that current vision …

Visual commonsense r-cnn

T Wang, J Huang, H Zhang… - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com
We present a novel unsupervised feature representation learning method, Visual
Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an …

Modeling relationships in referential expressions with compositional modular networks

R Hu, M Rohrbach, J Andreas… - Proceedings of the …, 2017 - openaccess.thecvf.com
People often refer to entities in an image in terms of their relationships with other entities. For
example," the black cat sitting under the table" refers to both a" black cat" entity and its …

Weakly-supervised learning of visual relations

J Peyre, J Sivic, I Laptev… - Proceedings of the ieee …, 2017 - openaccess.thecvf.com
This paper introduces a novel approach for modeling visual relations between pairs of
objects. We call relation a triplet of the form (subject, predicate, object) where the predicate …

PIGLeT: Language grounding through neuro-symbolic interaction in a 3D world

R Zellers, A Holtzman, M Peters, R Mottaghi… - arxiv preprint arxiv …, 2021 - arxiv.org
We propose PIGLeT: a model that learns physical commonsense knowledge through
interaction, and then uses this knowledge to ground language. We factorize PIGLeT into a …

Things not written in text: Exploring spatial commonsense from visual signals

X Liu, D Yin, Y Feng, D Zhao - arxiv preprint arxiv:2203.08075, 2022 - arxiv.org
Spatial commonsense, the knowledge about spatial position and relationship between
objects (like the relative size of a lion and a girl, and the position of a boy relative to a bicycle …

Envisioning narrative intelligence: A creative visual storytelling anthology

BA Halperin, SM Lukin - Proceedings of the 2023 CHI Conference on …, 2023 - dl.acm.org
In this paper, we collect an anthology of 100 visual stories from authors who participated in
our systematic creative process of improvised story-building based on image sequences …

Text2scene: Generating compositional scenes from textual descriptions

F Tan, S Feng, V Ordonez - … of the IEEE/CVF Conference on …, 2019 - openaccess.thecvf.com
In this paper, we propose Text2Scene, a model that generates various forms of
compositional scene representations from natural language descriptions. Unlike recent …