Foundations & trends in multimodal machine learning: Principles, challenges, and open questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
Egoschema: A diagnostic benchmark for very long-form video language understanding
We introduce EgoSchema, a very long-form video question-answering dataset, and
benchmark to evaluate long video understanding capabilities of modern vision and …
benchmark to evaluate long video understanding capabilities of modern vision and …
Moviechat: From dense token to sparse memory for long video understanding
Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …
A-okvqa: A benchmark for visual question answering using world knowledge
Abstract The Visual Question Answering (VQA) task aspires to provide a meaningful testbed
for the development of AI models that can jointly reason over visual and natural language …
for the development of AI models that can jointly reason over visual and natural language …
Zero-shot video question answering via frozen bidirectional language models
Video question answering (VideoQA) is a complex task that requires diverse multi-modal
data for training. Manual annotation of question and answers for videos, however, is tedious …
data for training. Manual annotation of question and answers for videos, however, is tedious …
Mmbench-video: A long-form multi-shot benchmark for holistic video understanding
The advent of large vision-language models (LVLMs) has spurred research into their
applications in multi-modal contexts, particularly in video understanding. Traditional …
applications in multi-modal contexts, particularly in video understanding. Traditional …
Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering
Abstract To build Video Question Answering (VideoQA) systems capable of assisting
humans in daily activities, seeking answers from long-form videos with diverse and complex …
humans in daily activities, seeking answers from long-form videos with diverse and complex …
Next-qa: Next phase of question-answering to explaining temporal actions
We introduce NExT-QA, a rigorously designed video question answering (VideoQA)
benchmark to advance video understanding from describing to explaining the temporal …
benchmark to advance video understanding from describing to explaining the temporal …
Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension
Alongside huge volumes of research on deep learning models in NLP in the recent years,
there has been much work on benchmark datasets needed to track modeling progress …
there has been much work on benchmark datasets needed to track modeling progress …