Gsva: Generalized segmentation via multimodal large language models

Z **a, D Han, Y Han, X Pan, S Song… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Generalized Referring Expression Segmentation (GRES) extends the scope of
classic RES to refer to multiple objects in one expression or identify the empty targets absent …

Efficient diffusion transformer with step-wise dynamic attention mediators

Y Pu, Z **a, J Guo, D Han, Q Li, D Li, Y Yuan… - … on Computer Vision, 2024 - Springer
This paper identifies significant redundancy in the query-key interactions within self-attention
mechanisms of diffusion transformer models, particularly during the early stages of …

Mosaic: in-memory computing and routing for small-world spike-based neuromorphic systems

T Dalgaty, F Moro, Y Demirağ, A De Pra… - Nature …, 2024 - nature.com
The brain's connectivity is locally dense and globally sparse, forming a small-world graph—
a principle prevalent in the evolution of various species, suggesting a universal solution for …

Ct-net: Asymmetric compound branch transformer for medical image segmentation

N Zhang, L Yu, D Zhang, W Wu, S Tian, X Kang, M Li - Neural Networks, 2024 - Elsevier
The Transformer architecture has been widely applied in the field of image segmentation
due to its powerful ability to capture long-range dependencies. However, its ability to capture …

Lookupvit: Compressing visual information to a limited number of tokens

R Koner, G Jain, P Jain, V Tresp, S Paul - European Conference on …, 2024 - Springer
Abstract Vision Transformers (ViT) have emerged as the de-facto choice for numerous
industry grade vision solutions. But their inference cost can be prohibitive for many settings …

Dat++: Spatially dynamic vision transformer with deformable attention

Z **a, X Pan, S Song, LE Li, G Huang - arxiv preprint arxiv:2309.01430, 2023 - arxiv.org
Transformers have shown superior performance on various vision tasks. Their large
receptive field endows Transformer models with higher representation power than their CNN …

Efficient Vision Transformers with Partial Attention

XT Vo, DL Nguyen, A Priadana, KH Jo - European Conference on …, 2024 - Springer
As a core of Vision Transformer (ViT), self-attention has high versatility in modeling long-
range spatial interactions because every query attends to all spatial locations. Although ViT …

TransXNet: learning both global and local dynamics with a dual dynamic token mixer for visual recognition

M Lou, HY Zhou, S Yang, Y Yu - arxiv preprint arxiv:2310.19380, 2023 - arxiv.org
Recent studies have integrated convolution into transformers to introduce inductive bias and
improve generalization performance. However, the static nature of conventional convolution …

ViT-MVT: A Unified Vision Transformer Network for Multiple Vision Tasks

T **e, K Dai, Z Jiang, R Li, S Mao… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
In this work, we seek to learn multiple mainstream vision tasks concurrently using a unified
network, which is storage-efficient as numerous networks with task-shared parameters can …

MG-ViT: a multi-granularity method for compact and efficient vision transformers

Y Zhang, Y Liu, D Miao, Q Zhang… - Advances in Neural …, 2024 - proceedings.neurips.cc
Abstract Vision Transformer (ViT) faces obstacles in wide application due to its huge
computational cost. Almost all existing studies on compressing ViT adopt the manner of …