Repvit: Revisiting mobile cnn from vit perspective

A Wang, H Chen, Z Lin, J Han… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract Recently lightweight Vision Transformers (ViTs) demonstrate superior performance
and lower latency compared with lightweight Convolutional Neural Networks (CNNs) on …

Spike-driven transformer

M Yao, J Hu, Z Zhou, L Yuan, Y Tian… - Advances in neural …, 2024 - proceedings.neurips.cc
Abstract Spiking Neural Networks (SNNs) provide an energy-efficient deep learning option
due to their unique spike-based event-driven (ie, spike-driven) paradigm. In this paper, we …

Mobileclip: Fast image-text models through multi-modal reinforced training

PKA Vasu, H Pouransari, F Faghri… - Proceedings of the …, 2024 - openaccess.thecvf.com
Contrastive pre-training of image-text foundation models such as CLIP demonstrated
excellent zero-shot performance and improved robustness on a wide range of downstream …

Pem: Prototype-based efficient maskformer for image segmentation

N Cavagnero, G Rosi, C Cuttano… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recent transformer-based architectures have shown impressive results in the field of image
segmentation. Thanks to their flexibility they obtain outstanding performance in multiple …

Shvit: Single-head vision transformer with memory efficient macro design

S Yun, Y Ro - Proceedings of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Abstract Recently efficient Vision Transformers have shown great performance with low
latency on resource-constrained devices. Conventionally they use 4x4 patch embeddings …

Optimizing underwater image enhancement: integrating semi-supervised learning and multi-scale aggregated attention

S Xu, J Wang, N He, G Xu, G Zhang - The Visual Computer, 2024 - Springer
Underwater image enhancement is critical for advancing marine science and underwater
engineering. Traditional methods often struggle with color distortion, low contrast, and …

Cas-vit: Convolutional additive self-attention vision transformers for efficient mobile applications

T Zhang, L Li, Y Zhou, W Liu, C Qian, X Ji - arxiv preprint arxiv …, 2024 - arxiv.org
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token
mixer's powerful global context capability. However, the pairwise token affinity and complex …

HF-HRNet: a simple hardware friendly high-resolution network

H Zhang, Y Dun, Y Pei, S Lai, C Liu… - … on Circuits and …, 2024 - ieeexplore.ieee.org
High-resolution networks have made significant progress in dense prediction tasks such as
human pose estimation and semantic segmentation. To better explore this high-resolution …

Swiftdepth: An efficient hybrid cnn-transformer model for self-supervised monocular depth estimation on mobile devices

A Luginov, I Makarov - 2023 IEEE International Symposium on …, 2023 - ieeexplore.ieee.org
Self-supervised Monocular Depth Estimation (MDE) models trained solely on single-camera
video have gained significant popularity. Recent studies have shown that Vision …

Efficient Vision Transformers with Partial Attention

XT Vo, DL Nguyen, A Priadana, KH Jo - European Conference on …, 2024 - Springer
As a core of Vision Transformer (ViT), self-attention has high versatility in modeling long-
range spatial interactions because every query attends to all spatial locations. Although ViT …