Multimodal intelligence: Representation learning, information fusion, and applications
Deep learning methods haverevolutionized speech recognition, image recognition, and
natural language processing since 2010. Each of these tasks involves a single modality in …
natural language processing since 2010. Each of these tasks involves a single modality in …
Multimodal machine learning: A survey and taxonomy
Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …
odors, and taste flavors. Modality refers to the way in which something happens or is …
Self-supervised multimodal versatile networks
Videos are a rich source of multi-modal supervision. In this work, we learn representations
using self-supervision by leveraging three modalities naturally present in videos: visual …
using self-supervision by leveraging three modalities naturally present in videos: visual …
End-to-end dense video captioning with parallel decoding
Dense video captioning aims to generate multiple associated captions with their temporal
locations from the video. Previous methods follow a sophisticated" localize-then-describe" …
locations from the video. Previous methods follow a sophisticated" localize-then-describe" …
Hero: Hierarchical encoder for video+ language omni-representation pre-training
We present HERO, a novel framework for large-scale video+ language omni-representation
learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of …
learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of …
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips
Learning text-video embeddings usually requires a dataset of video clips with manually
provided captions. However, such datasets are expensive and time consuming to create and …
provided captions. However, such datasets are expensive and time consuming to create and …
Actbert: Learning global-local video-text representations
In this paper, we introduce ActBERT for self-supervised learning of joint video-text
representations from unlabeled data. First, we leverage global action information to catalyze …
representations from unlabeled data. First, we leverage global action information to catalyze …
Exploring visual relationship for image captioning
It is always well believed that modeling relationships between objects would be helpful for
representing and eventually describing an image. Nevertheless, there has not been …
representing and eventually describing an image. Nevertheless, there has not been …
T2vlad: global-local sequence alignment for text-video retrieval
Text-video retrieval is a challenging task that aims to search relevant video contents based
on natural language descriptions. The key to this problem is to measure text-video …
on natural language descriptions. The key to this problem is to measure text-video …
Use what you have: Video retrieval using representations from collaborative experts
The rapid growth of video on the internet has made searching for video content using natural
language queries a significant challenge. Human-generated queries for video datasetsin the …
language queries a significant challenge. Human-generated queries for video datasetsin the …