Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Transformers in vision: A survey
Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …
vision community to study their application to computer vision problems. Among their salient …
Human action recognition from various data modalities: A review
Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …
each action. It has a wide range of applications, and therefore has been attracting increasing …
Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction
The vision-based perception for autonomous driving has undergone a transformation from
the bird-eye-view (BEV) representations to the 3D semantic occupancy. Compared with the …
the bird-eye-view (BEV) representations to the 3D semantic occupancy. Compared with the …
Visual prompt tuning
The current modus operandi in adapting pre-trained models involves updating all the
backbone parameters, ie., full fine-tuning. This paper introduces Visual Prompt Tuning (VPT) …
backbone parameters, ie., full fine-tuning. This paper introduces Visual Prompt Tuning (VPT) …
Actionformer: Localizing moments of actions with transformers
Self-attention based Transformer models have demonstrated impressive results for image
classification and object detection, and more recently for video understanding. Inspired by …
classification and object detection, and more recently for video understanding. Inspired by …
Mvitv2: Improved multiscale vision transformers for classification and detection
In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for
image and video classification, as well as object detection. We present an improved version …
image and video classification, as well as object detection. We present an improved version …
Flava: A foundational language and vision alignment model
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic
pretraining for obtaining good performance on a variety of downstream tasks. Generally …
pretraining for obtaining good performance on a variety of downstream tasks. Generally …
Star-transformer: a spatio-temporal cross attention transformer for human action recognition
In action recognition, although the combination of spatio-temporal videos and skeleton
features can improve the recognition performance, a separate model and balancing feature …
features can improve the recognition performance, a separate model and balancing feature …
Multiscale vision transformers
Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition,
by connecting the seminal idea of multiscale feature hierarchies with transformer models …
by connecting the seminal idea of multiscale feature hierarchies with transformer models …
Vivit: A video vision transformer
We present pure-transformer based models for video classification, drawing upon the recent
success of such models in image classification. Our model extracts spatio-temporal tokens …
success of such models in image classification. Our model extracts spatio-temporal tokens …