Multimodal machine learning: A survey and taxonomy

T Baltrušaitis, C Ahuja… - IEEE transactions on …, 2018 - ieeexplore.ieee.org
Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …

A tutorial on multilabel learning

E Gibaja, S Ventura - ACM Computing Surveys (CSUR), 2015 - dl.acm.org
Multilabel learning has become a relevant learning paradigm in the past years due to the
increasing number of fields where it can be applied and also to the emerging number of …

Regionclip: Region-based language-image pretraining

Y Zhong, J Yang, P Zhang, C Li… - Proceedings of the …, 2022 - openaccess.thecvf.com
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved
impressive results on image classification in both zero-shot and transfer learning settings …

AutoAD: Movie description in context

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com
The objective of this paper is an automatic Audio Description (AD) model that ingests movies
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …

Objects that sound

R Arandjelovic, A Zisserman - Proceedings of the European …, 2018 - openaccess.thecvf.com
In this paper our objectives are, first, networks that can embed audio and visual inputs into a
common space that is suitable for cross-modal retrieval; and second, a network that can …

Scene graph generation from objects, phrases and region captions

Y Li, W Ouyang, B Zhou, K Wang… - Proceedings of the …, 2017 - openaccess.thecvf.com
Object detection, scene graph generation and region captioning, which are three scene
understanding tasks at different semantic levels, are tied together: scene graphs are …

Densecap: Fully convolutional localization networks for dense captioning

J Johnson, A Karpathy… - Proceedings of the IEEE …, 2016 - openaccess.thecvf.com
We introduce the dense captioning task, which requires a computer vision system to both
localize and describe salient regions in images in natural language. The dense captioning …

Deep collaborative embedding for social image understanding

Z Li, J Tang, T Mei - IEEE transactions on pattern analysis and …, 2018 - ieeexplore.ieee.org
In this work, we investigate the problem of learning knowledge from the massive community-
contributed images with rich weakly-supervised context information, which can benefit …

Microsoft coco captions: Data collection and evaluation server

X Chen, H Fang, TY Lin, R Vedantam, S Gupta… - arxiv preprint arxiv …, 2015 - arxiv.org
In this paper we describe the Microsoft COCO Caption dataset and evaluation server. When
completed, the dataset will contain over one and a half million captions describing over …

Deep visual-semantic alignments for generating image descriptions

A Karpathy, L Fei-Fei - … of the IEEE conference on computer …, 2015 - openaccess.thecvf.com
We present a model that generates natural language descriptions of images and their
regions. Our approach leverages datasets of images and their sentence descriptions to …