Token contrast for weakly-supervised semantic segmentation
Abstract Weakly-Supervised Semantic Segmentation (WSSS) using image-level labels
typically utilizes Class Activation Map (CAM) to generate the pseudo labels. Limited by the …
typically utilizes Class Activation Map (CAM) to generate the pseudo labels. Limited by the …
Video transformers: A survey
Transformer models have shown great success handling long-range interactions, making
them a promising tool for modeling video. However, they lack inductive biases and scale …
them a promising tool for modeling video. However, they lack inductive biases and scale …
Adamv-moe: Adaptive multi-task vision mixture-of-experts
Abstract Sparsely activated Mixture-of-Experts (MoE) is becoming a promising paradigm for
multi-task learning (MTL). Instead of compressing multiple tasks' knowledge into a single …
multi-task learning (MTL). Instead of compressing multiple tasks' knowledge into a single …
Masked relation learning for deepfake detection
DeepFake detection aims to differentiate falsified faces from real ones. Most approaches
formulate it as a binary classification problem by solely mining the local artifacts and …
formulate it as a binary classification problem by solely mining the local artifacts and …
Ppt: token-pruned pose transformer for monocular and multi-view human pose estimation
Recently, the vision transformer and its variants have played an increasingly important role
in both monocular and multi-view human pose estimation. Considering image patches as …
in both monocular and multi-view human pose estimation. Considering image patches as …
Sparse moe as the new dropout: Scaling dense and self-slimmable transformers
Despite their remarkable achievement, gigantic transformers encounter significant
drawbacks, including exorbitant computational and memory footprints during training, as …
drawbacks, including exorbitant computational and memory footprints during training, as …
Shvit: Single-head vision transformer with memory efficient macro design
Abstract Recently efficient Vision Transformers have shown great performance with low
latency on resource-constrained devices. Conventionally they use 4x4 patch embeddings …
latency on resource-constrained devices. Conventionally they use 4x4 patch embeddings …
The lighter the better: rethinking transformers in medical image segmentation through adaptive pruning
Vision transformers have recently set off a new wave in the field of medical image analysis
due to their remarkable performance on various computer vision tasks. However, recent …
due to their remarkable performance on various computer vision tasks. However, recent …