Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Transformers in vision: A survey
Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …
vision community to study their application to computer vision problems. Among their salient …
Seqtrack: Sequence to sequence learning for visual object tracking
In this paper, we present a new sequence-to-sequence learning framework for visual
tracking, dubbed SeqTrack. It casts visual tracking as a sequence generation problem …
tracking, dubbed SeqTrack. It casts visual tracking as a sequence generation problem …
Universal instance perception as object discovery and retrieval
All instance perception tasks aim at finding certain objects specified by some queries such
as category names, language expressions, and target annotations, but this complete field …
as category names, language expressions, and target annotations, but this complete field …
Vary: Scaling up the vision vocabulary for large vision-language model
Abstract Most Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary, ie,
CLIP, for common vision tasks. However, for some special task that needs dense and fine …
CLIP, for common vision tasks. However, for some special task that needs dense and fine …
A survey of visual transformers
Transformer, an attention-based encoder–decoder model, has already revolutionized the
field of natural language processing (NLP). Inspired by such significant achievements, some …
field of natural language processing (NLP). Inspired by such significant achievements, some …
Ferret: Refer and ground anything anywhere at any granularity
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of
understanding spatial referring of any shape or granularity within an image and accurately …
understanding spatial referring of any shape or granularity within an image and accurately …
Towards open vocabulary learning: A survey
In the field of visual scene understanding, deep neural networks have made impressive
advancements in various core tasks like segmentation, tracking, and detection. However …
advancements in various core tasks like segmentation, tracking, and detection. However …
VLT: Vision-language transformer and query generation for referring segmentation
We propose a Vision-Language Transformer (VLT) framework for referring segmentation to
facilitate deep interactions among multi-modal information and enhance the holistic …
facilitate deep interactions among multi-modal information and enhance the holistic …
Seqtr: A simple yet universal network for visual grounding
In this paper, we propose a simple yet universal network termed SeqTR for visual grounding
tasks, eg, phrase localization, referring expression comprehension (REC) and segmentation …
tasks, eg, phrase localization, referring expression comprehension (REC) and segmentation …