Multimodal intelligence: Representation learning, information fusion, and applications

C Zhang, Z Yang, X He, L Deng - IEEE Journal of Selected …, 2020 - ieeexplore.ieee.org
Deep learning methods haverevolutionized speech recognition, image recognition, and
natural language processing since 2010. Each of these tasks involves a single modality in …

Multimodal machine learning: A survey and taxonomy

T Baltrušaitis, C Ahuja… - IEEE transactions on …, 2018 - ieeexplore.ieee.org
Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …

Self-supervised multimodal versatile networks

JB Alayrac, A Recasens, R Schneider… - Advances in neural …, 2020 - proceedings.neurips.cc
Videos are a rich source of multi-modal supervision. In this work, we learn representations
using self-supervision by leveraging three modalities naturally present in videos: visual …

End-to-end dense video captioning with parallel decoding

T Wang, R Zhang, Z Lu, F Zheng… - Proceedings of the …, 2021 - openaccess.thecvf.com
Dense video captioning aims to generate multiple associated captions with their temporal
locations from the video. Previous methods follow a sophisticated" localize-then-describe" …

Hero: Hierarchical encoder for video+ language omni-representation pre-training

L Li, YC Chen, Y Cheng, Z Gan, L Yu, J Liu - arxiv preprint arxiv …, 2020 - arxiv.org
We present HERO, a novel framework for large-scale video+ language omni-representation
learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of …

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

A Miech, D Zhukov, JB Alayrac… - Proceedings of the …, 2019 - openaccess.thecvf.com
Learning text-video embeddings usually requires a dataset of video clips with manually
provided captions. However, such datasets are expensive and time consuming to create and …

Actbert: Learning global-local video-text representations

L Zhu, Y Yang - Proceedings of the IEEE/CVF conference …, 2020 - openaccess.thecvf.com
In this paper, we introduce ActBERT for self-supervised learning of joint video-text
representations from unlabeled data. First, we leverage global action information to catalyze …

Exploring visual relationship for image captioning

T Yao, Y Pan, Y Li, T Mei - Proceedings of the European …, 2018 - openaccess.thecvf.com
It is always well believed that modeling relationships between objects would be helpful for
representing and eventually describing an image. Nevertheless, there has not been …

T2vlad: global-local sequence alignment for text-video retrieval

X Wang, L Zhu, Y Yang - … of the IEEE/CVF conference on …, 2021 - openaccess.thecvf.com
Text-video retrieval is a challenging task that aims to search relevant video contents based
on natural language descriptions. The key to this problem is to measure text-video …

Use what you have: Video retrieval using representations from collaborative experts

Y Liu, S Albanie, A Nagrani, A Zisserman - arxiv preprint arxiv …, 2019 - arxiv.org
The rapid growth of video on the internet has made searching for video content using natural
language queries a significant challenge. Human-generated queries for video datasetsin the …