الباحث العلمي من Google

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024‏ - dl.acm.org‏

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …‏

حفظ اقتباس تم اقتباسها في عدد: 79 مقالات ذات صلة

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions‏

PP Liang, A Zadeh, LP Morency - arxiv preprint arxiv:2209.03430, 2022‏ - arxiv.org‏

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …‏

حفظ اقتباس تم اقتباسها في عدد: 153 مقالات ذات صلة الإصدارات الـ 2كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Egoschema: A diagnostic benchmark for very long-form video language understanding‏

K Mangalam, R Akshulakov… - Advances in Neural …, 2023‏ - proceedings.neurips.cc‏

We introduce EgoSchema, a very long-form video question-answering dataset, and
benchmark to evaluate long video understanding capabilities of modern vision and …‏

حفظ اقتباس تم اقتباسها في عدد: 167 مقالات ذات صلة الإصدارات الـ 5كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Moviechat: From dense token to sparse memory for long video understanding‏

E Song, W Chai, G Wang, Y Zhang… - Proceedings of the …, 2024‏ - openaccess.thecvf.com‏

Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …‏

حفظ اقتباس تم اقتباسها في عدد: 182 مقالات ذات صلة الإصدارات الـ 3كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A-okvqa: A benchmark for visual question answering using world knowledge‏

D Schwenk, A Khandelwal, C Clark, K Marino… - European conference on …, 2022‏ - Springer‏

Abstract The Visual Question Answering (VQA) task aspires to provide a meaningful testbed
for the development of AI models that can jointly reason over visual and natural language …‏

حفظ اقتباس تم اقتباسها في عدد: 429 مقالات ذات صلة الإصدارات الـ 5كلها

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Zero-shot video question answering via frozen bidirectional language models‏

A Yang, A Miech, J Sivic, I Laptev… - Advances in Neural …, 2022‏ - proceedings.neurips.cc‏

Video question answering (VideoQA) is a complex task that requires diverse multi-modal
data for training. Manual annotation of question and answers for videos, however, is tedious …‏

حفظ اقتباس تم اقتباسها في عدد: 235 مقالات ذات صلة الإصدارات الـ 11كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Mmbench-video: A long-form multi-shot benchmark for holistic video understanding‏

X Fang, K Mao, H Duan, X Zhao, Y Li… - Advances in Neural …, 2025‏ - proceedings.neurips.cc‏

The advent of large vision-language models (LVLMs) has spurred research into their
applications in multi-modal contexts, particularly in video understanding. Traditional …‏

حفظ اقتباس تم اقتباسها في عدد: 29 مقالات ذات صلة إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering‏

D Gao, L Zhou, L Ji, L Zhu, Y Yang… - Proceedings of the …, 2023‏ - openaccess.thecvf.com‏

Abstract To build Video Question Answering (VideoQA) systems capable of assisting
humans in daily activities, seeking answers from long-form videos with diverse and complex …‏

حفظ اقتباس تم اقتباسها في عدد: 97 مقالات ذات صلة الإصدارات الـ 8كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Next-qa: Next phase of question-answering to explaining temporal actions‏

J **ao, X Shang, A Yao… - Proceedings of the IEEE …, 2021‏ - openaccess.thecvf.com‏

We introduce NExT-QA, a rigorously designed video question answering (VideoQA)
benchmark to advance video understanding from describing to explaining the temporal …‏

حفظ اقتباس تم اقتباسها في عدد: 369 مقالات ذات صلة الإصدارات الـ 6كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension‏

A Rogers, M Gardner, I Augenstein - ACM Computing Surveys, 2023‏ - dl.acm.org‏

Alongside huge volumes of research on deep learning models in NLP in the recent years,
there has been much work on benchmark datasets needed to track modeling progress …‏

حفظ اقتباس تم اقتباسها في عدد: 234 مقالات ذات صلة الإصدارات الـ 6كلها

إنشاء تنبيه

اقتباس

بحث متقدم

تم حفظ المقالة في مكتبتي.

Movieqa: Understanding stories in movies through question-answering

Foundations & trends in multimodal machine learning: Principles, challenges, and open questions‏

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions‏

Egoschema: A diagnostic benchmark for very long-form video language understanding‏

Moviechat: From dense token to sparse memory for long video understanding‏

A-okvqa: A benchmark for visual question answering using world knowledge‏

Zero-shot video question answering via frozen bidirectional language models‏

Mmbench-video: A long-form multi-shot benchmark for holistic video understanding‏

Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering‏

Next-qa: Next phase of question-answering to explaining temporal actions‏

Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension‏