Multimodal machine learning: A survey and taxonomy

T Baltrušaitis, C Ahuja… - IEEE transactions on …, 2018 - ieeexplore.ieee.org
Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …

Binary relevance for multi-label learning: an overview

ML Zhang, YK Li, XY Liu, X Geng - Frontiers of Computer Science, 2018 - Springer
Multi-label learning deals with problems where each example is represented by a single
instance while being associated with multiple class labels simultaneously. Binary relevance …

Photorealistic text-to-image diffusion models with deep language understanding

C Saharia, W Chan, S Saxena, L Li… - Advances in neural …, 2022 - proceedings.neurips.cc
We present Imagen, a text-to-image diffusion model with an unprecedented degree of
photorealism and a deep level of language understanding. Imagen builds on the power of …

Prompting visual-language models for efficient video understanding

C Ju, T Han, K Zheng, Y Zhang, W **e - European Conference on …, 2022 - Springer
Image-based visual-language (I-VL) pre-training has shown great success for learning joint
visual-textual representations from large-scale web data, revealing remarkable ability for …

Self-supervised multimodal versatile networks

JB Alayrac, A Recasens, R Schneider… - Advances in neural …, 2020 - proceedings.neurips.cc
Videos are a rich source of multi-modal supervision. In this work, we learn representations
using self-supervision by leveraging three modalities naturally present in videos: visual …

Promptdet: Towards open-vocabulary detection using uncurated images

C Feng, Y Zhong, Z Jie, X Chu, H Ren, X Wei… - … on Computer Vision, 2022 - Springer
The goal of this work is to establish a scalable pipeline for expanding an object detector
towards novel/unseen categories, using zero manual annotations. To achieve that, we make …

XNLI: Evaluating cross-lingual sentence representations

A Conneau, G Lample, R Rinott, A Williams… - arxiv preprint arxiv …, 2018 - arxiv.org
State-of-the-art natural language processing systems rely on supervision in the form of
annotated data to learn competent models. These models are generally trained on data in a …

Variational autoencoders for collaborative filtering

D Liang, RG Krishnan, MD Hoffman… - Proceedings of the 2018 …, 2018 - dl.acm.org
We extend variational autoencoders (VAEs) to collaborative filtering for implicit feedback.
This non-linear probabilistic model enables us to go beyond the limited modeling capacity of …

Recent trends in deep learning based natural language processing

T Young, D Hazarika, S Poria… - ieee Computational …, 2018 - ieeexplore.ieee.org
Deep learning methods employ multiple processing layers to learn hierarchical
representations of data, and have produced state-of-the-art results in many domains …

Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly

Y **an, CH Lampert, B Schiele… - IEEE transactions on …, 2018 - ieeexplore.ieee.org
Due to the importance of zero-shot learning, ie, classifying images where there is a lack of
labeled training data, the number of proposed approaches has recently increased steadily …