Mvitv2: Improved multiscale vision transformers for classification and detection
In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for
image and video classification, as well as object detection. We present an improved version …
image and video classification, as well as object detection. We present an improved version …
Uniformer: Unifying convolution and self-attention for visual recognition
It is a challenging task to learn discriminative representation from images and videos, due to
large local redundancy and complex global dependency in these visual data. Convolution …
large local redundancy and complex global dependency in these visual data. Convolution …
Multiscale vision transformers
H Fan, B ** your eye on the ball: Trajectory attention in video transformers
In video transformers, the time dimension is often treated in the same way as the two spatial
dimensions. However, in a scene where objects or the camera may move, a physical point …
dimensions. However, in a scene where objects or the camera may move, a physical point …
A large-scale study on unsupervised spatiotemporal representation learning
We present a large-scale study on unsupervised spatiotemporal representation learning
from videos. With a unified perspective on four recent image-based frameworks, we study a …
from videos. With a unified perspective on four recent image-based frameworks, we study a …
A simple multi-modality transfer learning baseline for sign language translation
This paper proposes a simple transfer learning baseline for sign language translation.
Existing sign language datasets (eg PHOENIX-2014T, CSL-Daily) contain only about 10K …
Existing sign language datasets (eg PHOENIX-2014T, CSL-Daily) contain only about 10K …
Recurring the transformer for video action recognition
Existing video understanding approaches, such as 3D convolutional neural networks and
Transformer-Based methods, usually process the videos in a clip-wise manner. Hence huge …
Transformer-Based methods, usually process the videos in a clip-wise manner. Hence huge …