الباحث العلمي من Google

Convolutional neural networks or vision transformers: Who will win the race for action recognitions in visual data?‏

O Moutik, H Sekkat, S Tigani, A Chehri, R Saadane… - Sensors, 2023‏ - mdpi.com‏

Understanding actions in videos remains a significant challenge in computer vision, which
has been the subject of several pieces of research in the last decades. Convolutional neural …‏

حفظ اقتباس تم اقتباسها في عدد: 72 مقالات ذات صلة الإصدارات الـ 12كلها نسخة مخزَّنة مؤقتًا

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Attention bottlenecks for multimodal fusion‏

A Nagrani, S Yang, A Arnab, A Jansen… - Advances in neural …, 2021‏ - proceedings.neurips.cc‏

Humans perceive the world by concurrently processing and fusing high-dimensional inputs
from multiple modalities such as vision and audio. Machine perception models, in stark …‏

حفظ اقتباس تم اقتباسها في عدد: 648 مقالات ذات صلة الإصدارات الـ 8كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Video swin transformer‏

Z Liu, J Ning, Y Cao, Y Wei, Z Zhang… - Proceedings of the …, 2022‏ - openaccess.thecvf.com‏

The vision community is witnessing a modeling shift from CNNs to Transformers, where pure
Transformer architectures have attained top accuracy on the major video recognition …‏

حفظ اقتباس تم اقتباسها في عدد: 2075 مقالات ذات صلة الإصدارات الـ 8كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vivit: A video vision transformer‏

A Arnab, M Dehghani, G Heigold… - Proceedings of the …, 2021‏ - openaccess.thecvf.com‏

We present pure-transformer based models for video classification, drawing upon the recent
success of such models in image classification. Our model extracts spatio-temporal tokens …‏

حفظ اقتباس تم اقتباسها في عدد: 2708 مقالات ذات صلة الإصدارات الـ 9كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Crossvit: Cross-attention multi-scale vision transformer for image classification‏

CFR Chen, Q Fan, R Panda - Proceedings of the IEEE/CVF …, 2021‏ - openaccess.thecvf.com‏

The recently developed vision transformer (ViT) has achieved promising results on image
classification compared to convolutional neural networks. Inspired by this, in this paper, we …‏

حفظ اقتباس تم اقتباسها في عدد: 1846 مقالات ذات صلة الإصدارات الـ 9كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text‏

H Akbari, L Yuan, R Qian… - Advances in …, 2021‏ - proceedings.neurips.cc‏

We present a framework for learning multimodal representations from unlabeled data using
convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer …‏

حفظ اقتباس تم اقتباسها في عدد: 695 مقالات ذات صلة الإصدارات الـ 9كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

[PDF][PDF] Is space-time attention all you need for video understanding?‏

G Bertasius, H Wang, L Torresani - ICML, 2021‏ - proceedings.mlr.press‏

Training. We train our model for 15 epochs with an initial learning rate of 0.005, which is
divided by 10 at epochs 11, and 14. During training, we first resize the shorter side of the …‏

حفظ اقتباس تم اقتباسها في عدد: 2425 مقالات ذات صلة الإصدارات الـ 4كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Bevt: Bert pretraining of video transformers‏

R Wang, D Chen, Z Wu, Y Chen… - Proceedings of the …, 2022‏ - openaccess.thecvf.com‏

This paper studies the BERT pretraining of video transformers. It is a straightforward but
worth-studying extension given the recent success from BERT pretraining of image …‏

حفظ اقتباس تم اقتباسها في عدد: 260 مقالات ذات صلة الإصدارات الـ 6كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Tdn: Temporal difference networks for efficient action recognition‏

L Wang, Z Tong, B Ji, G Wu - Proceedings of the IEEE/CVF …, 2021‏ - openaccess.thecvf.com‏

Temporal modeling still remains challenging for action recognition in videos. To mitigate this
issue, this paper presents a new video architecture, termed as Temporal Difference Network …‏

حفظ اقتباس تم اقتباسها في عدد: 512 مقالات ذات صلة الإصدارات الـ 8كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Movinets: Mobile video networks for efficient video recognition‏

D Kondratyuk, L Yuan, Y Li, L Zhang… - Proceedings of the …, 2021‏ - openaccess.thecvf.com‏

Abstract We present Mobile Video Networks (MoViNets), a family of computation and
memory efficient video networks that can operate on streaming video for online inference …‏

حفظ اقتباس تم اقتباسها في عدد: 306 مقالات ذات صلة الإصدارات الـ 8كلها إصدار HTML‏

إنشاء تنبيه

اقتباس

بحث متقدم

تم حفظ المقالة في مكتبتي.

More is less: Learning efficient video representations by big-little network and depthwise...

Convolutional neural networks or vision transformers: Who will win the race for action recognitions in visual data?‏

Attention bottlenecks for multimodal fusion‏

Video swin transformer‏

Vivit: A video vision transformer‏

Crossvit: Cross-attention multi-scale vision transformer for image classification‏

Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text‏

[PDF][PDF] Is space-time attention all you need for video understanding?‏

Bevt: Bert pretraining of video transformers‏

Tdn: Temporal difference networks for efficient action recognition‏

Movinets: Mobile video networks for efficient video recognition‏