Repvit: Revisiting mobile cnn from vit perspective
Abstract Recently lightweight Vision Transformers (ViTs) demonstrate superior performance
and lower latency compared with lightweight Convolutional Neural Networks (CNNs) on …
and lower latency compared with lightweight Convolutional Neural Networks (CNNs) on …
Spike-driven transformer
Abstract Spiking Neural Networks (SNNs) provide an energy-efficient deep learning option
due to their unique spike-based event-driven (ie, spike-driven) paradigm. In this paper, we …
due to their unique spike-based event-driven (ie, spike-driven) paradigm. In this paper, we …
Mobileclip: Fast image-text models through multi-modal reinforced training
Contrastive pre-training of image-text foundation models such as CLIP demonstrated
excellent zero-shot performance and improved robustness on a wide range of downstream …
excellent zero-shot performance and improved robustness on a wide range of downstream …
Pem: Prototype-based efficient maskformer for image segmentation
Recent transformer-based architectures have shown impressive results in the field of image
segmentation. Thanks to their flexibility they obtain outstanding performance in multiple …
segmentation. Thanks to their flexibility they obtain outstanding performance in multiple …
Shvit: Single-head vision transformer with memory efficient macro design
Abstract Recently efficient Vision Transformers have shown great performance with low
latency on resource-constrained devices. Conventionally they use 4x4 patch embeddings …
latency on resource-constrained devices. Conventionally they use 4x4 patch embeddings …
Optimizing underwater image enhancement: integrating semi-supervised learning and multi-scale aggregated attention
S Xu, J Wang, N He, G Xu, G Zhang - The Visual Computer, 2024 - Springer
Underwater image enhancement is critical for advancing marine science and underwater
engineering. Traditional methods often struggle with color distortion, low contrast, and …
engineering. Traditional methods often struggle with color distortion, low contrast, and …
Cas-vit: Convolutional additive self-attention vision transformers for efficient mobile applications
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token
mixer's powerful global context capability. However, the pairwise token affinity and complex …
mixer's powerful global context capability. However, the pairwise token affinity and complex …
HF-HRNet: a simple hardware friendly high-resolution network
High-resolution networks have made significant progress in dense prediction tasks such as
human pose estimation and semantic segmentation. To better explore this high-resolution …
human pose estimation and semantic segmentation. To better explore this high-resolution …
Swiftdepth: An efficient hybrid cnn-transformer model for self-supervised monocular depth estimation on mobile devices
Self-supervised Monocular Depth Estimation (MDE) models trained solely on single-camera
video have gained significant popularity. Recent studies have shown that Vision …
video have gained significant popularity. Recent studies have shown that Vision …
Efficient Vision Transformers with Partial Attention
As a core of Vision Transformer (ViT), self-attention has high versatility in modeling long-
range spatial interactions because every query attends to all spatial locations. Although ViT …
range spatial interactions because every query attends to all spatial locations. Although ViT …