Detrs with hybrid matching
One-to-one set matching is a key design for DETR to establish its end-to-end capability, so
that object detection does not require a hand-crafted NMS (non-maximum suppression) to …
that object detection does not require a hand-crafted NMS (non-maximum suppression) to …
Motrv2: Bootstrap** end-to-end multi-object tracking by pretrained object detectors
In this paper, we propose MOTRv2, a simple yet effective pipeline to bootstrap end-to-end
multi-object tracking with a pretrained object detector. Existing end-to-end methods, eg …
multi-object tracking with a pretrained object detector. Existing end-to-end methods, eg …
A simple single-scale vision transformer for object localization and instance segmentation
This work presents a simple vision transformer design as a strong baseline for object
localization and instance segmentation tasks. Transformers recently demonstrate …
localization and instance segmentation tasks. Transformers recently demonstrate …
Language as queries for referring video object segmentation
Referring video object segmentation (R-VOS) is an emerging cross-modal task that aims to
segment the target object referred by a language expression in all video frames. In this work …
segment the target object referred by a language expression in all video frames. In this work …
End-to-end temporal action detection with transformer
Temporal action detection (TAD) aims to determine the semantic label and the temporal
interval of every action instance in an untrimmed video. It is a fundamental and challenging …
interval of every action instance in an untrimmed video. It is a fundamental and challenging …
Minvis: A minimal video instance segmentation framework without video-based training
We propose MinVIS, a minimal video instance segmentation (VIS) framework that achieves
state-of-the-art VIS performance with neither video-based architectures nor training …
state-of-the-art VIS performance with neither video-based architectures nor training …
Vita: Video instance segmentation via object token association
We introduce a novel paradigm for offline Video Instance Segmentation (VIS), based on the
hypothesis that explicit object-oriented information can be a strong clue for understanding …
hypothesis that explicit object-oriented information can be a strong clue for understanding …
Html: Hybrid temporal-scale multimodal learning framework for referring video object segmentation
Abstract Referring Video Object Segmentation (RVOS) is to segment the object instance
from a given video, according to the textual description of this object. However, in the open …
from a given video, according to the textual description of this object. However, in the open …
Temporal collection and distribution for referring video object segmentation
Referring video object segmentation aims to segment a referent throughout a video
sequence according to a natural language expression. It requires aligning the natural …
sequence according to a natural language expression. It requires aligning the natural …
Visa: Reasoning video object segmentation via large language models
Abstract Existing Video Object Segmentation (VOS) relies on explicit user instructions, such
as categories, masks, or short phrases, restricting their ability to perform complex video …
as categories, masks, or short phrases, restricting their ability to perform complex video …