Foundations & trends in multimodal machine learning: Principles, challenges, and open questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
Autoad ii: The sequel-who, when, and what in movie audio description
Audio Description (AD) is the task of generating descriptions of visual content, at suitable
time intervals, for the benefit of visually impaired audiences. For movies, this presents …
time intervals, for the benefit of visually impaired audiences. For movies, this presents …
Coot: Cooperative hierarchical transformer for video-text representation learning
Many real-world video-text tasks involve different levels of granularity, such as frames and
words, clip and sentences or videos and paragraphs, each with distinct semantics. In this …
words, clip and sentences or videos and paragraphs, each with distinct semantics. In this …
Multimodal machine learning: A survey and taxonomy
Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …
odors, and taste flavors. Modality refers to the way in which something happens or is …
AutoAD: Movie description in context
The objective of this paper is an automatic Audio Description (AD) model that ingests movies
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …
Movieqa: Understanding stories in movies through question-answering
We introduce the MovieQA dataset which aims to evaluate automatic story comprehension
from both video and text. The dataset consists of 14,944 questions about 408 movies with …
from both video and text. The dataset consists of 14,944 questions about 408 movies with …
Aligning books and movies: Towards story-like visual explanations by watching movies and reading books
Books are a rich source of both fine-grained information, how a character, an object or a
scene looks like, as well as high-level semantics, what someone is thinking, feeling and how …
scene looks like, as well as high-level semantics, what someone is thinking, feeling and how …
Semantic conditioned dynamic modulation for temporal sentence grounding in videos
Temporal sentence grounding in videos aims to detect and localize one target video
segment, which semantically corresponds to a given sentence. Existing methods mainly …
segment, which semantically corresponds to a given sentence. Existing methods mainly …
To find where you talk: Temporal sentence localization in video with attention based location regression
We have witnessed the tremendous growth of videos over the Internet, where most of these
videos are typically paired with abundant sentence descriptions, such as video titles …
videos are typically paired with abundant sentence descriptions, such as video titles …