محقق Google

J Li, T Tang, WX Zhao, JY Nie, JR Wen - ACM Computing Surveys, 2024‏ - dl.acm.org‏

Text Generation aims to produce plausible and readable text in human language from input
data. The resurgence of deep learning has greatly advanced this field, in particular, with the …‏

ذخیره ارجاع بیان شده در 428 یافته مقاله‌های مربوط تمام نسخه‌های 12

[Free GPT-4]
[DeepSeek]

[PDF] springer.com

Video description: A comprehensive survey of deep learning approaches‏

G Rafiq, M Rafiq, GS Choi - Artificial Intelligence Review, 2023‏ - Springer‏

Video description refers to understanding visual content and transforming that acquired
understanding into automatic textual narration. It bridges the key AI fields of computer vision …‏

ذخیره ارجاع بیان شده در 29 یافته مقاله‌های مربوط تمام نسخه‌های 6

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Panda-70m: Captioning 70m videos with multiple cross-modality teachers‏

TS Chen, A Siarohin, W Menapace… - Proceedings of the …, 2024‏ - openaccess.thecvf.com‏

The quality of the data and annotation upper-bounds the quality of a downstream model.
While there exist large text corpora and image-text pairs high-quality video-text data is much …‏

ذخیره ارجاع بیان شده در 141 یافته مقاله‌های مربوط تمام نسخه‌های 8 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning‏

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023‏ - openaccess.thecvf.com‏

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …‏

ذخیره ارجاع بیان شده در 240 یافته مقاله‌های مربوط تمام نسخه‌های 19 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Timechat: A time-sensitive multimodal large language model for long video understanding‏

S Ren, L Yao, S Li, X Sun… - Proceedings of the IEEE …, 2024‏ - openaccess.thecvf.com‏

This work proposes TimeChat a time-sensitive multimodal large language model specifically
designed for long video understanding. Our model incorporates two key architectural …‏

ذخیره ارجاع بیان شده در 134 یافته مقاله‌های مربوط تمام نسخه‌های 7 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] ieee.org

Multimodal learning with transformers: A survey‏

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023‏ - ieeexplore.ieee.org‏

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …‏

ذخیره ارجاع بیان شده در 646 یافته مقاله‌های مربوط تمام نسخه‌های 11

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Pali-x: On scaling up a multilingual vision and language model‏

X Chen, J Djolonga, P Padlewski, B Mustafa… - arxiv preprint arxiv …, 2023‏ - arxiv.org‏

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and
language model, both in terms of size of the components and the breadth of its training task …‏

ذخیره ارجاع بیان شده در 173 یافته مقاله‌های مربوط تمام نسخه‌های 3 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Streaming dense video captioning‏

X Zhou, A Arnab, S Buch, S Yan… - Proceedings of the …, 2024‏ - openaccess.thecvf.com‏

An ideal model for dense video captioning--predicting captions localized temporally in a
video--should be able to handle long input videos predict rich detailed textual descriptions …‏

ذخیره ارجاع بیان شده در 29 یافته مقاله‌های مربوط تمام نسخه‌های 7 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] ai-data-base.com

[PDF][PDF] Learning interactive real-world simulators‏

M Yang, Y Du, K Ghasemipour… - arxiv preprint arxiv …, 2023‏ - ai-data-base.com‏

Generative models trained on internet data have revolutionized how text, image, and video
content can be created. Perhaps the next milestone for generative models is to simulate …‏

ذخیره ارجاع بیان شده در 109 یافته مقاله‌های مربوط تمام نسخه‌های 5 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Reflect: Summarizing robot experiences for failure explanation and correction‏

Z Liu, A Bahety, S Song - arxiv preprint arxiv:2306.15724, 2023‏ - arxiv.org‏

The ability to detect and analyze failed executions automatically is crucial for an explainable
and robust robotic system. Recently, Large Language Models (LLMs) have demonstrated …‏

ذخیره ارجاع بیان شده در 107 یافته مقاله‌های مربوط تمام نسخه‌های 6 نسخه HTML

ایجاد هشدار

ارجاع

جستجوی پیشرفته

در «کتابخانه من» ذخیره شد

End-to-end dense video captioning with parallel decoding

Pre-trained language models for text generation: A survey‏

Video description: A comprehensive survey of deep learning approaches‏

Panda-70m: Captioning 70m videos with multiple cross-modality teachers‏

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning‏

Timechat: A time-sensitive multimodal large language model for long video understanding‏

Multimodal learning with transformers: A survey‏

Pali-x: On scaling up a multilingual vision and language model‏

Streaming dense video captioning‏

[PDF][PDF] Learning interactive real-world simulators‏

Reflect: Summarizing robot experiences for failure explanation and correction‏