Multimodal machine learning: A survey and taxonomy
T Baltrušaitis, C Ahuja… - IEEE transactions on …, 2018 - ieeexplore.ieee.org
Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …
odors, and taste flavors. Modality refers to the way in which something happens or is …
A tutorial on multilabel learning
E Gibaja, S Ventura - ACM Computing Surveys (CSUR), 2015 - dl.acm.org
Multilabel learning has become a relevant learning paradigm in the past years due to the
increasing number of fields where it can be applied and also to the emerging number of …
increasing number of fields where it can be applied and also to the emerging number of …
Regionclip: Region-based language-image pretraining
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved
impressive results on image classification in both zero-shot and transfer learning settings …
impressive results on image classification in both zero-shot and transfer learning settings …
AutoAD: Movie description in context
The objective of this paper is an automatic Audio Description (AD) model that ingests movies
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …
Objects that sound
In this paper our objectives are, first, networks that can embed audio and visual inputs into a
common space that is suitable for cross-modal retrieval; and second, a network that can …
common space that is suitable for cross-modal retrieval; and second, a network that can …
Scene graph generation from objects, phrases and region captions
Object detection, scene graph generation and region captioning, which are three scene
understanding tasks at different semantic levels, are tied together: scene graphs are …
understanding tasks at different semantic levels, are tied together: scene graphs are …
Densecap: Fully convolutional localization networks for dense captioning
We introduce the dense captioning task, which requires a computer vision system to both
localize and describe salient regions in images in natural language. The dense captioning …
localize and describe salient regions in images in natural language. The dense captioning …
Deep collaborative embedding for social image understanding
In this work, we investigate the problem of learning knowledge from the massive community-
contributed images with rich weakly-supervised context information, which can benefit …
contributed images with rich weakly-supervised context information, which can benefit …
Microsoft coco captions: Data collection and evaluation server
In this paper we describe the Microsoft COCO Caption dataset and evaluation server. When
completed, the dataset will contain over one and a half million captions describing over …
completed, the dataset will contain over one and a half million captions describing over …
Deep visual-semantic alignments for generating image descriptions
We present a model that generates natural language descriptions of images and their
regions. Our approach leverages datasets of images and their sentence descriptions to …
regions. Our approach leverages datasets of images and their sentence descriptions to …