Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Transformers in vision: A survey

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org
Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …

Seqtrack: Sequence to sequence learning for visual object tracking

X Chen, H Peng, D Wang, H Lu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
In this paper, we present a new sequence-to-sequence learning framework for visual
tracking, dubbed SeqTrack. It casts visual tracking as a sequence generation problem …

Universal instance perception as object discovery and retrieval

B Yan, Y Jiang, J Wu, D Wang, P Luo… - Proceedings of the …, 2023 - openaccess.thecvf.com
All instance perception tasks aim at finding certain objects specified by some queries such
as category names, language expressions, and target annotations, but this complete field …

Vary: Scaling up the vision vocabulary for large vision-language model

H Wei, L Kong, J Chen, L Zhao, Z Ge, J Yang… - … on Computer Vision, 2024 - Springer
Abstract Most Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary, ie,
CLIP, for common vision tasks. However, for some special task that needs dense and fine …

A survey of visual transformers

Y Liu, Y Zhang, Y Wang, F Hou, J Yuan… - … on Neural Networks …, 2023 - ieeexplore.ieee.org
Transformer, an attention-based encoder–decoder model, has already revolutionized the
field of natural language processing (NLP). Inspired by such significant achievements, some …

Ferret: Refer and ground anything anywhere at any granularity

H You, H Zhang, Z Gan, X Du, B Zhang, Z Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of
understanding spatial referring of any shape or granularity within an image and accurately …

Towards open vocabulary learning: A survey

J Wu, X Li, S Xu, H Yuan, H Ding… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
In the field of visual scene understanding, deep neural networks have made impressive
advancements in various core tasks like segmentation, tracking, and detection. However …

VLT: Vision-language transformer and query generation for referring segmentation

H Ding, C Liu, S Wang, X Jiang - IEEE Transactions on Pattern …, 2022 - ieeexplore.ieee.org
We propose a Vision-Language Transformer (VLT) framework for referring segmentation to
facilitate deep interactions among multi-modal information and enhance the holistic …

Seqtr: A simple yet universal network for visual grounding

C Zhu, Y Zhou, Y Shen, G Luo, X Pan, M Lin… - … on Computer Vision, 2022 - Springer
In this paper, we propose a simple yet universal network termed SeqTR for visual grounding
tasks, eg, phrase localization, referring expression comprehension (REC) and segmentation …