الباحث العلمي من Google

S Islam, H Elmekki, A Elsebai, J Bentahar… - Expert Systems with …, 2024‏ - Elsevier‏

Abstract Transformers are Deep Neural Networks (DNN) that utilize a self-attention
mechanism to capture contextual relationships within sequential data. Unlike traditional …‏

حفظ اقتباس تم اقتباسها في عدد: 188 مقالات ذات صلة الإصدارات الـ 4كلها

[Free GPT-4]
[DeepSeek]

[PDF] nowpublishers.com

Vision-language pre-training: Basics, recent advances, and future trends‏

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022‏ - nowpublishers.com‏

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …‏

حفظ اقتباس تم اقتباسها في عدد: 199 مقالات ذات صلة الإصدارات الـ 7كلها بحث عن المكتبات إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Palm-e: An embodied multimodal language model‏

D Driess, F ** an end-to-end chat-centric video
understanding system, coined as VideoChat. It integrates video foundation models and …‏

حفظ اقتباس تم اقتباسها في عدد: 588 مقالات ذات صلة الإصدارات الـ 4كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Sigmoid loss for language image pre-training‏

X Zhai, B Mustafa, A Kolesnikov… - Proceedings of the …, 2023‏ - openaccess.thecvf.com‏

We propose a simple pairwise sigmoid loss for image-text pre-training. Unlike standard
contrastive learning with softmax normalization, the sigmoid loss operates solely on image …‏

حفظ اقتباس تم اقتباسها في عدد: 641 مقالات ذات صلة الإصدارات الـ 5كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Reproducible scaling laws for contrastive language-image learning‏

M Cherti, R Beaumont, R Wightman… - Proceedings of the …, 2023‏ - openaccess.thecvf.com‏

Scaling up neural networks has led to remarkable performance across a wide range of
tasks. Moreover, performance often follows reliable scaling laws as a function of training set …‏

حفظ اقتباس تم اقتباسها في عدد: 713 مقالات ذات صلة الإصدارات الـ 6كلها إصدار HTML‏

Internvideo2: Scaling foundation models for multimodal video understanding‏

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024‏ - Springer‏

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …‏

حفظ اقتباس تم اقتباسها في عدد: 119 مقالات ذات صلة الإصدارات الـ 3كلها

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning‏

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023‏ - openaccess.thecvf.com‏

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …‏

حفظ اقتباس تم اقتباسها في عدد: 239 مقالات ذات صلة الإصدارات الـ 26كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Cogvlm: Visual expert for pretrained language models‏

W Wang, Q Lv, W Yu, W Hong, J Qi, Y Wang… - arxiv preprint arxiv …, 2023‏ - arxiv.org‏

We introduce CogVLM, a powerful open-source visual language foundation model. Different
from the popular shallow alignment method which maps image features into the input space …‏

حفظ اقتباس تم اقتباسها في عدد: 566 مقالات ذات صلة الإصدارات الـ 3كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Pali: A jointly-scaled multilingual language-image model‏

X Chen, X Wang, S Changpinyo… - arxiv preprint arxiv …, 2022‏ - arxiv.org‏

Effective scaling and a flexible task interface enable large language models to excel at many
tasks. We present PaLI (Pathways Language and Image model), a model that extends this …‏

حفظ اقتباس تم اقتباسها في عدد: 684 مقالات ذات صلة الإصدارات الـ 6كلها إصدار HTML‏

إنشاء تنبيه

اقتباس

بحث متقدم

تم حفظ المقالة في مكتبتي.

Scaling up vision-language pre-training for image captioning

A comprehensive survey on applications of transformers for deep learning tasks‏

Vision-language pre-training: Basics, recent advances, and future trends‏

Palm-e: An embodied multimodal language model‏

Sigmoid loss for language image pre-training‏

Reproducible scaling laws for contrastive language-image learning‏

Internvideo2: Scaling foundation models for multimodal video understanding‏

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning‏

Cogvlm: Visual expert for pretrained language models‏

Pali: A jointly-scaled multilingual language-image model‏