- Academic Search

Y Li, CY Wu, H Fan, K Mangalam… - Proceedings of the …, 2022 - openaccess.thecvf.com

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for
image and video classification, as well as object detection. We present an improved version …

Spara Citera Citerat av 854 Relaterade artiklar Alla 6 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Uniformer: Unifying convolution and self-attention for visual recognition

K Li, Y Wang, J Zhang, P Gao, G Song… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org

It is a challenging task to learn discriminative representation from images and videos, due to
large local redundancy and complex global dependency in these visual data. Convolution …

Spara Citera Citerat av 411 Relaterade artiklar Alla 6 versionerna

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Multiscale vision transformers

H Fan, B ** your eye on the ball: Trajectory attention in video transformers

M Patrick, D Campbell, Y Asano… - Advances in neural …, 2021 - proceedings.neurips.cc

In video transformers, the time dimension is often treated in the same way as the two spatial
dimensions. However, in a scene where objects or the camera may move, a physical point …

Spara Citera Citerat av 289 Relaterade artiklar Alla 13 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

A large-scale study on unsupervised spatiotemporal representation learning

C Feichtenhofer, H Fan, B **ong… - Proceedings of the …, 2021 - openaccess.thecvf.com

We present a large-scale study on unsupervised spatiotemporal representation learning
from videos. With a unified perspective on four recent image-based frameworks, we study a …

Spara Citera Citerat av 308 Relaterade artiklar Alla 6 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

A simple multi-modality transfer learning baseline for sign language translation

Y Chen, F Wei, X Sun, Z Wu… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

This paper proposes a simple transfer learning baseline for sign language translation.
Existing sign language datasets (eg PHOENIX-2014T, CSL-Daily) contain only about 10K …

Spara Citera Citerat av 159 Relaterade artiklar Alla 5 versionerna Se som HTML-version

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Recurring the transformer for video action recognition

J Yang, X Dong, L Liu, C Zhang… - Proceedings of the …, 2022 - openaccess.thecvf.com

Existing video understanding approaches, such as 3D convolutional neural networks and
Transformer-Based methods, usually process the videos in a clip-wise manner. Hence huge …

Spara Citera Citerat av 114 Relaterade artiklar Alla 4 versionerna Se som HTML-version

Skapa alarm

Citera

Avancerad sökning

Har sparats i Mitt bibliotek

Pyslowfast

Mvitv2: Improved multiscale vision transformers for classification and detection

Uniformer: Unifying convolution and self-attention for visual recognition

Multiscale vision transformers

A large-scale study on unsupervised spatiotemporal representation learning

A simple multi-modality transfer learning baseline for sign language translation

Recurring the transformer for video action recognition