Multimodal machine learning: A survey and taxonomy
Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …
odors, and taste flavors. Modality refers to the way in which something happens or is …
Binary relevance for multi-label learning: an overview
Multi-label learning deals with problems where each example is represented by a single
instance while being associated with multiple class labels simultaneously. Binary relevance …
instance while being associated with multiple class labels simultaneously. Binary relevance …
Photorealistic text-to-image diffusion models with deep language understanding
We present Imagen, a text-to-image diffusion model with an unprecedented degree of
photorealism and a deep level of language understanding. Imagen builds on the power of …
photorealism and a deep level of language understanding. Imagen builds on the power of …
Prompting visual-language models for efficient video understanding
Image-based visual-language (I-VL) pre-training has shown great success for learning joint
visual-textual representations from large-scale web data, revealing remarkable ability for …
visual-textual representations from large-scale web data, revealing remarkable ability for …
Self-supervised multimodal versatile networks
Videos are a rich source of multi-modal supervision. In this work, we learn representations
using self-supervision by leveraging three modalities naturally present in videos: visual …
using self-supervision by leveraging three modalities naturally present in videos: visual …
Promptdet: Towards open-vocabulary detection using uncurated images
The goal of this work is to establish a scalable pipeline for expanding an object detector
towards novel/unseen categories, using zero manual annotations. To achieve that, we make …
towards novel/unseen categories, using zero manual annotations. To achieve that, we make …
XNLI: Evaluating cross-lingual sentence representations
State-of-the-art natural language processing systems rely on supervision in the form of
annotated data to learn competent models. These models are generally trained on data in a …
annotated data to learn competent models. These models are generally trained on data in a …
Variational autoencoders for collaborative filtering
We extend variational autoencoders (VAEs) to collaborative filtering for implicit feedback.
This non-linear probabilistic model enables us to go beyond the limited modeling capacity of …
This non-linear probabilistic model enables us to go beyond the limited modeling capacity of …
Recent trends in deep learning based natural language processing
Deep learning methods employ multiple processing layers to learn hierarchical
representations of data, and have produced state-of-the-art results in many domains …
representations of data, and have produced state-of-the-art results in many domains …
Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly
Due to the importance of zero-shot learning, ie, classifying images where there is a lack of
labeled training data, the number of proposed approaches has recently increased steadily …
labeled training data, the number of proposed approaches has recently increased steadily …