- Academic Search

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org

Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …

Gem Citer Citeret af 2973 Relaterede artikler Alle 8 versioner

MPCCT: Multimodal vision-language learning paradigm with context-based compact Transformer

C Chen, D Han, CC Chang - Pattern recognition, 2024 - Elsevier

Transformer and its variants have become the preferred option for multimodal vision-
language paradigms. However, they struggle with tasks that demand high-dependency …

Gem Citer Citeret af 101 Relaterede artikler Alle 3 versioner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A survey of visual transformers

Y Liu, Y Zhang, Y Wang, F Hou, J Yuan… - … on Neural Networks …, 2023 - ieeexplore.ieee.org

Transformer, an attention-based encoder–decoder model, has already revolutionized the
field of natural language processing (NLP). Inspired by such significant achievements, some …

Gem Citer Citeret af 342 Relaterede artikler Alle 8 versioner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Rsvg: Exploring data and models for visual grounding on remote sensing data

Y Zhan, Z ** machines better understand …

Gem Citer Citeret af 90 Relaterede artikler Alle 3 versioner

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Referring transformer: A one-step approach to multi-task visual grounding

M Li, L Sigal - Advances in neural information processing …, 2021 - proceedings.neurips.cc

As an important step towards visual reasoning, visual grounding (eg, phrase localization,
referring expression comprehension/segmentation) has been widely explored. Previous …

Gem Citer Citeret af 172 Relaterede artikler Alle 9 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Local-global context aware transformer for language-guided video segmentation

C Liang, W Wang, T Zhou, J Miao… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

We explore the task of language-guided video segmentation (LVS). Previous algorithms
mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context …

Gem Citer Citeret af 91 Relaterede artikler Alle 11 versioner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Transvg++: End-to-end visual grounding with language conditioned vision transformer

J Deng, Z Yang, D Liu, T Chen, W Zhou… - IEEE transactions on …, 2023 - ieeexplore.ieee.org

In this work, we explore neat yet effective Transformer-based frameworks for visual
grounding. The previous methods generally address the core problem of visual grounding …

Gem Citer Citeret af 59 Relaterede artikler Alle 7 versioner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Context disentangling and prototype inheriting for robust visual grounding

W Tang, L Li, X Liu, L **, J Tang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Visual grounding (VG) aims to locate a specific target in an image based on a given
language query. The discriminative information from context is important for distinguishing …

Gem Citer Citeret af 23 Relaterede artikler Alle 8 versioner

Transformer-based visual grounding with cross-modality interaction

K Li, J Li, D Guo, X Yang, M Wang - ACM Transactions on Multimedia …, 2023 - dl.acm.org

This article tackles the challenging yet important task of Visual Grounding (VG), which aims
to localize a visual region in the given image referred by a natural language query. Existing …

Gem Citer Citeret af 41 Relaterede artikler

Opret underretning

Citer

Avanceret søgning

Gemt i Min samling

Block-FeST: A blockchain-based federated anomaly detection framework with computation offloading...

Transformers in vision: A survey

MPCCT: Multimodal vision-language learning paradigm with context-based compact Transformer

A survey of visual transformers

Rsvg: Exploring data and models for visual grounding on remote sensing data

Referring transformer: A one-step approach to multi-task visual grounding

Local-global context aware transformer for language-guided video segmentation

Transvg++: End-to-end visual grounding with language conditioned vision transformer

Context disentangling and prototype inheriting for robust visual grounding

Transformer-based visual grounding with cross-modality interaction