MPCCT: Multimodal vision-language learning paradigm with context-based compact Transformer

C Chen, D Han, CC Chang - Pattern recognition, 2024 - Elsevier
Transformer and its variants have become the preferred option for multimodal vision-
language paradigms. However, they struggle with tasks that demand high-dependency …

Rsvg: Exploring data and models for visual grounding on remote sensing data

Y Zhan, Z **ong, Y Yuan - IEEE Transactions on Geoscience …, 2023 - ieeexplore.ieee.org
In this article, we introduce the task of visual grounding for remote sensing data (RSVG).
RSVG aims to localize the referred objects in remote sensing (RS) images with the guidance …

Language adaptive weight generation for multi-task visual grounding

W Su, P Miao, H Dou, G Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Although the impressive performance in visual grounding, the prevailing approaches usually
exploit the visual backbone in a passive way, ie, the visual backbone extracts features with …

X-mesh: Towards fast and accurate text-driven 3d stylization via dynamic textual guidance

Y Ma, X Zhang, X Sun, J Ji, H Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Text-driven 3D stylization is a complex and crucial task in the fields of computer vision (CV)
and computer graphics (CG), aimed at transforming a bare mesh to fit a target text. Prior …

Transvg++: End-to-end visual grounding with language conditioned vision transformer

J Deng, Z Yang, D Liu, T Chen, W Zhou… - IEEE transactions on …, 2023 - ieeexplore.ieee.org
In this work, we explore neat yet effective Transformer-based frameworks for visual
grounding. The previous methods generally address the core problem of visual grounding …

Grounded multimodal named entity recognition on social media

J Yu, Z Li, J Wang, R **a - … of the 61st Annual Meeting of the …, 2023 - aclanthology.org
Abstract In recent years, Multimodal Named Entity Recognition (MNER) on social media has
attracted considerable attention. However, existing MNER studies only extract entity-type …

Lgr-net: Language guided reasoning network for referring expression comprehension

M Lu, R Li, F Feng, Z Ma, X Wang - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Referring Expression Comprehension (REC) is a fundamental task in the vision and
language domain, which aims to locate an image region according to a natural language …

Scanformer: Referring expression comprehension by iteratively scanning

W Su, P Miao, H Dou, X Li - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Abstract Referring Expression Comprehension (REC) aims to localize the target objects
specified by free-form natural language descriptions in images. While state-of-the-art …

Unifying visual and vision-language tracking via contrastive learning

Y Ma, Y Tang, W Yang, T Zhang, J Zhang… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Single object tracking aims to locate the target object in a video sequence according to the
state specified by different modal references, including the initial bounding box (BBOX) …

Language-guided progressive attention for visual grounding in remote sensing images

K Li, D Wang, H Xu, H Zhong… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Visual grounding in remote sensing (RSVG) images aims to detect specific objects
associated with referring expressions in remote sensing images. Existing methods typically …