محقق Google

Y Wei, D Hu, Y Tian, X Li - arxiv preprint arxiv:2208.09579, 2022‏ - arxiv.org‏

Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …‏

ذخیره ارجاع بیان شده در 68 یافته مقاله‌های مربوط تمام نسخه‌های 2 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Onellm: One framework to align all modalities with language‏

J Han, K Gong, Y Zhang, J Wang… - Proceedings of the …, 2024‏ - openaccess.thecvf.com‏

Multimodal large language models (MLLMs) have gained significant attention due to their
strong multimodal understanding capability. However existing works rely heavily on modality …‏

ذخیره ارجاع بیان شده در 102 یافته مقاله‌های مربوط تمام نسخه‌های 6 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Lavis: A library for language-vision intelligence‏

D Li, J Li, H Le, G Wang, S Savarese… - arxiv preprint arxiv …, 2022‏ - arxiv.org‏

We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research
and applications. LAVIS aims to serve as a one-stop comprehensive library that brings …‏

ذخیره ارجاع بیان شده در 132 یافته مقاله‌های مربوط تمام نسخه‌های 4 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Videollm-online: Online video large language model for streaming video‏

J Chen, Z Lv, S Wu, KQ Lin, C Song… - Proceedings of the …, 2024‏ - openaccess.thecvf.com‏

Abstract Large Language Models (LLMs) have been enhanced with vision capabilities
enabling them to comprehend images videos and interleaved vision-language content …‏

ذخیره ارجاع بیان شده در 26 یافته مقاله‌های مربوط تمام نسخه‌های 8 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Learning to answer questions in dynamic audio-visual scenarios‏

G Li, Y Wei, Y Tian, C Xu, JR Wen… - Proceedings of the …, 2022‏ - openaccess.thecvf.com‏

In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to
answer questions regarding different visual objects, sounds, and their associations in …‏

ذخیره ارجاع بیان شده در 135 یافته مقاله‌های مربوط تمام نسخه‌های 8 نسخه HTML

[Free GPT-4]
[DeepSeek]

[HTML] sciencedirect.com

[HTML][HTML] Learning towards conversational AI: A survey‏

T Fu, S Gao, X Zhao, J Wen, R Yan - AI Open, 2022‏ - Elsevier‏

Recent years have witnessed a surge of interest in the field of open-domain dialogue.
Thanks to the rapid development of social media, large dialogue corpus from the Internet …‏

ذخیره ارجاع بیان شده در 40 یافته مقاله‌های مربوط تمام نسخه‌های 2

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably)‏

Y Huang, J Lin, C Zhou, H Yang… - … conference on machine …, 2022‏ - proceedings.mlr.press‏

Despite the remarkable success of deep multi-modal learning in practice, it has not been
well-explained in theory. Recently, it has been observed that the best uni-modal network …‏

ذخیره ارجاع بیان شده در 103 یافته مقاله‌های مربوط تمام نسخه‌های 5 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration‏

C Lyu, M Wu, L Wang, X Huang, B Liu, Z Du… - arxiv preprint arxiv …, 2023‏ - arxiv.org‏

Although instruction-tuned large language models (LLMs) have exhibited remarkable
capabilities across various NLP tasks, their effectiveness on other data modalities beyond …‏

ذخیره ارجاع بیان شده در 87 یافته مقاله‌های مربوط تمام نسخه‌های 4 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

What makes training multi-modal classification networks hard?‏

W Wang, D Tran, M Feiszli - … of the IEEE/CVF conference on …, 2020‏ - openaccess.thecvf.com‏

Consider end-to-end training of a multi-modal vs. a uni-modal network on a task with
multiple input modalities: the multi-modal network receives more information, so it should …‏

ذخیره ارجاع بیان شده در 418 یافته مقاله‌های مربوط تمام نسخه‌های 8 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

PLATO: Pre-trained dialogue generation model with discrete latent variable‏

S Bao, H He, F Wang, H Wu, H Wang - arxiv preprint arxiv:1910.07931, 2019‏ - arxiv.org‏

Pre-training models have been proved effective for a wide range of natural language
processing tasks. Inspired by this, we propose a novel dialogue generation pre-training …‏

ذخیره ارجاع بیان شده در 282 یافته مقاله‌های مربوط تمام نسخه‌های 4 نسخه HTML

ایجاد هشدار

ارجاع

جستجوی پیشرفته

در «کتابخانه من» ذخیره شد

Audio visual scene-aware dialog

Learning in audio-visual context: A review, analysis, and new perspective‏

Onellm: One framework to align all modalities with language‏

Lavis: A library for language-vision intelligence‏

Videollm-online: Online video large language model for streaming video‏

Learning to answer questions in dynamic audio-visual scenarios‏

[HTML][HTML] Learning towards conversational AI: A survey‏

Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably)‏

Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration‏

What makes training multi-modal classification networks hard?‏

PLATO: Pre-trained dialogue generation model with discrete latent variable‏