محقق Google

Z Chen, Y Zhang, Y Fang, Y Geng, L Guo… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the
semantic web community's exploration into multi-modal dimensions unlocking new avenues …‏

ذخیره ارجاع بیان شده در 47 یافته مقاله‌های مربوط تمام نسخه‌های 2 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities‏

MF Ishmam, MSH Shovon, MF Mridha, N Dey - Information Fusion, 2024‏ - Elsevier‏

The multimodal task of Visual Question Answering (VQA) encompassing elements of
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …‏

ذخیره ارجاع بیان شده در 26 یافته مقاله‌های مربوط تمام نسخه‌های 3

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Egoschema: A diagnostic benchmark for very long-form video language understanding‏

K Mangalam, R Akshulakov… - Advances in Neural …, 2023‏ - proceedings.neurips.cc‏

We introduce EgoSchema, a very long-form video question-answering dataset, and
benchmark to evaluate long video understanding capabilities of modern vision and …‏

ذخیره ارجاع بیان شده در 177 یافته مقاله‌های مربوط تمام نسخه‌های 6 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A-okvqa: A benchmark for visual question answering using world knowledge‏

D Schwenk, A Khandelwal, C Clark, K Marino… - European conference on …, 2022‏ - Springer‏

Abstract The Visual Question Answering (VQA) task aspires to provide a meaningful testbed
for the development of AI models that can jointly reason over visual and natural language …‏

ذخیره ارجاع بیان شده در 428 یافته مقاله‌های مربوط تمام نسخه‌های 6

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Mmbench-video: A long-form multi-shot benchmark for holistic video understanding‏

X Fang, K Mao, H Duan, X Zhao, Y Li… - Advances in Neural …, 2025‏ - proceedings.neurips.cc‏

The advent of large vision-language models (LVLMs) has spurred research into their
applications in multi-modal contexts, particularly in video understanding. Traditional …‏

ذخیره ارجاع بیان شده در 32 یافته مقاله‌های مربوط تمام نسخه‌های 5 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Zero-shot video question answering via frozen bidirectional language models‏

A Yang, A Miech, J Sivic, I Laptev… - Advances in Neural …, 2022‏ - proceedings.neurips.cc‏

Video question answering (VideoQA) is a complex task that requires diverse multi-modal
data for training. Manual annotation of question and answers for videos, however, is tedious …‏

ذخیره ارجاع بیان شده در 231 یافته مقاله‌های مربوط تمام نسخه‌های 11 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Just ask: Learning to answer questions from millions of narrated videos‏

A Yang, A Miech, J Sivic, I Laptev… - Proceedings of the …, 2021‏ - openaccess.thecvf.com‏

Recent methods for visual question answering rely on large-scale annotated datasets.
Manual annotation of questions and answers for videos, however, is tedious, expensive and …‏

ذخیره ارجاع بیان شده در 320 یافته مقاله‌های مربوط تمام نسخه‌های 14 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Revive: Regional visual representation matters in knowledge-based visual question answering‏

Y Lin, Y **e, D Chen, Y Xu, C Zhu… - Advances in neural …, 2022‏ - proceedings.neurips.cc‏

This paper revisits visual representation in knowledge-based visual question answering
(VQA) and demonstrates that using regional information in a better way can significantly …‏

ذخیره ارجاع بیان شده در 102 یافته مقاله‌های مربوط تمام نسخه‌های 7 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Video question answering: Datasets, algorithms and challenges‏

Y Zhong, J **ao, W Ji, Y Li, W Deng… - arxiv preprint arxiv …, 2022‏ - arxiv.org‏

Video Question Answering (VideoQA) aims to answer natural language questions according
to the given videos. It has earned increasing attention with recent research trends in joint …‏

ذخیره ارجاع بیان شده در 99 یافته مقاله‌های مربوط تمام نسخه‌های 4 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Avqa: A dataset for audio-visual question answering on videos‏

P Yang, X Wang, X Duan, H Chen, R Hou… - Proceedings of the 30th …, 2022‏ - dl.acm.org‏

Audio-visual question answering aims to answer questions regarding both audio and visual
modalities in a given video, and has drawn increasing research interest in recent years …‏

ذخیره ارجاع بیان شده در 68 یافته مقاله‌های مربوط

ایجاد هشدار

ارجاع

جستجوی پیشرفته

در «کتابخانه من» ذخیره شد

KnowIT VQA: Answering knowledge-based questions about videos

Knowledge graphs meet multi-modal learning: A comprehensive survey‏

From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities‏

Egoschema: A diagnostic benchmark for very long-form video language understanding‏

A-okvqa: A benchmark for visual question answering using world knowledge‏

Mmbench-video: A long-form multi-shot benchmark for holistic video understanding‏

Zero-shot video question answering via frozen bidirectional language models‏

Just ask: Learning to answer questions from millions of narrated videos‏

Revive: Regional visual representation matters in knowledge-based visual question answering‏

Video question answering: Datasets, algorithms and challenges‏

Avqa: A dataset for audio-visual question answering on videos‏