Transformers in vision: A survey

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org
Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …

MPCCT: Multimodal vision-language learning paradigm with context-based compact Transformer

C Chen, D Han, CC Chang - Pattern recognition, 2024 - Elsevier
Transformer and its variants have become the preferred option for multimodal vision-
language paradigms. However, they struggle with tasks that demand high-dependency …

A survey of visual transformers

Y Liu, Y Zhang, Y Wang, F Hou, J Yuan… - … on Neural Networks …, 2023 - ieeexplore.ieee.org
Transformer, an attention-based encoder–decoder model, has already revolutionized the
field of natural language processing (NLP). Inspired by such significant achievements, some …

Referring transformer: A one-step approach to multi-task visual grounding

M Li, L Sigal - Advances in neural information processing …, 2021 - proceedings.neurips.cc
As an important step towards visual reasoning, visual grounding (eg, phrase localization,
referring expression comprehension/segmentation) has been widely explored. Previous …

Local-global context aware transformer for language-guided video segmentation

C Liang, W Wang, T Zhou, J Miao… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
We explore the task of language-guided video segmentation (LVS). Previous algorithms
mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context …

Transvg++: End-to-end visual grounding with language conditioned vision transformer

J Deng, Z Yang, D Liu, T Chen, W Zhou… - IEEE transactions on …, 2023 - ieeexplore.ieee.org
In this work, we explore neat yet effective Transformer-based frameworks for visual
grounding. The previous methods generally address the core problem of visual grounding …

Context disentangling and prototype inheriting for robust visual grounding

W Tang, L Li, X Liu, L **, J Tang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Visual grounding (VG) aims to locate a specific target in an image based on a given
language query. The discriminative information from context is important for distinguishing …

Transformer-based visual grounding with cross-modality interaction

K Li, J Li, D Guo, X Yang, M Wang - ACM Transactions on Multimedia …, 2023 - dl.acm.org
This article tackles the challenging yet important task of Visual Grounding (VG), which aims
to localize a visual region in the given image referred by a natural language query. Existing …