Knowledge graphs meet multi-modal learning: A comprehensive survey
Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …
semantic web community's exploration into multi-modal dimensions unlocking new avenues …
Neural motifs: Scene graph parsing with global context
We investigate the problem of producing structured graph representations of visual scenes.
Our work analyzes the role of motifs: regularly appearing substructures in scene graphs. We …
Our work analyzes the role of motifs: regularly appearing substructures in scene graphs. We …
Visual spatial reasoning
Spatial relations are a basic part of human cognition. However, they are expressed in
natural language in a variety of ways, and previous work has suggested that current vision …
natural language in a variety of ways, and previous work has suggested that current vision …
Visual commonsense r-cnn
We present a novel unsupervised feature representation learning method, Visual
Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an …
Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an …
Modeling relationships in referential expressions with compositional modular networks
People often refer to entities in an image in terms of their relationships with other entities. For
example," the black cat sitting under the table" refers to both a" black cat" entity and its …
example," the black cat sitting under the table" refers to both a" black cat" entity and its …
Weakly-supervised learning of visual relations
This paper introduces a novel approach for modeling visual relations between pairs of
objects. We call relation a triplet of the form (subject, predicate, object) where the predicate …
objects. We call relation a triplet of the form (subject, predicate, object) where the predicate …
PIGLeT: Language grounding through neuro-symbolic interaction in a 3D world
We propose PIGLeT: a model that learns physical commonsense knowledge through
interaction, and then uses this knowledge to ground language. We factorize PIGLeT into a …
interaction, and then uses this knowledge to ground language. We factorize PIGLeT into a …
Things not written in text: Exploring spatial commonsense from visual signals
Spatial commonsense, the knowledge about spatial position and relationship between
objects (like the relative size of a lion and a girl, and the position of a boy relative to a bicycle …
objects (like the relative size of a lion and a girl, and the position of a boy relative to a bicycle …
Envisioning narrative intelligence: A creative visual storytelling anthology
In this paper, we collect an anthology of 100 visual stories from authors who participated in
our systematic creative process of improvised story-building based on image sequences …
our systematic creative process of improvised story-building based on image sequences …
Text2scene: Generating compositional scenes from textual descriptions
In this paper, we propose Text2Scene, a model that generates various forms of
compositional scene representations from natural language descriptions. Unlike recent …
compositional scene representations from natural language descriptions. Unlike recent …