Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

PP Liang, A Zadeh, LP Morency - arxiv preprint arxiv:2209.03430, 2022 - arxiv.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Autoad ii: The sequel-who, when, and what in movie audio description

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com
Audio Description (AD) is the task of generating descriptions of visual content, at suitable
time intervals, for the benefit of visually impaired audiences. For movies, this presents …

Coot: Cooperative hierarchical transformer for video-text representation learning

S Ging, M Zolfaghari, H Pirsiavash… - Advances in neural …, 2020 - proceedings.neurips.cc
Many real-world video-text tasks involve different levels of granularity, such as frames and
words, clip and sentences or videos and paragraphs, each with distinct semantics. In this …

Multimodal machine learning: A survey and taxonomy

T Baltrušaitis, C Ahuja… - IEEE transactions on …, 2018 - ieeexplore.ieee.org
Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …

AutoAD: Movie description in context

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com
The objective of this paper is an automatic Audio Description (AD) model that ingests movies
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …

Movieqa: Understanding stories in movies through question-answering

M Tapaswi, Y Zhu, R Stiefelhagen… - Proceedings of the …, 2016 - openaccess.thecvf.com
We introduce the MovieQA dataset which aims to evaluate automatic story comprehension
from both video and text. The dataset consists of 14,944 questions about 408 movies with …

Aligning books and movies: Towards story-like visual explanations by watching movies and reading books

Y Zhu, R Kiros, R Zemel, R Salakhutdinov… - Proceedings of the …, 2015 - cv-foundation.org
Books are a rich source of both fine-grained information, how a character, an object or a
scene looks like, as well as high-level semantics, what someone is thinking, feeling and how …

Semantic conditioned dynamic modulation for temporal sentence grounding in videos

Y Yuan, L Ma, J Wang, W Liu… - Advances in Neural …, 2019 - proceedings.neurips.cc
Temporal sentence grounding in videos aims to detect and localize one target video
segment, which semantically corresponds to a given sentence. Existing methods mainly …

To find where you talk: Temporal sentence localization in video with attention based location regression

Y Yuan, T Mei, W Zhu - Proceedings of the AAAI Conference on Artificial …, 2019 - aaai.org
We have witnessed the tremendous growth of videos over the Internet, where most of these
videos are typically paired with abundant sentence descriptions, such as video titles …