Segment anything model for medical images?
Abstract The Segment Anything Model (SAM) is the first foundation model for general image
segmentation. It has achieved impressive results on various natural image segmentation …
segmentation. It has achieved impressive results on various natural image segmentation …
Multimodal foundation models: From specialists to general-purpose assistants
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …
methods to data compression. Recent advances in statistical machine learning have opened …
Sam-clip: Merging vision foundation models towards semantic and spatial understanding
The landscape of publicly available vision foundation models (VFMs) such as CLIP and
SAM is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their …
SAM is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their …
Llava-plus: Learning to use tools for creating multimodal agents
Abstract This paper presents LLaVA-Plus (L arge L anguage a nd V ision A ssistants that P
lug and L earn to U se S kills), a general-purpose multimodal assistant trained using an end …
lug and L earn to U se S kills), a general-purpose multimodal assistant trained using an end …
Tracking anything with decoupled video segmentation
Training data for video segmentation are expensive to annotate. This impedes extensions of
end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary …
end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary …
OMG-Seg: Is one model good enough for all segmentation?
In this work we address various segmentation tasks each traditionally tackled by distinct or
partially unified models. We propose OMG-Seg One Model that is Good enough to efficiently …
partially unified models. We propose OMG-Seg One Model that is Good enough to efficiently …
Moka: Open-vocabulary robotic manipulation through mark-based visual prompting
Open-vocabulary generalization requires robotic systems to perform tasks involving complex
and diverse environments and task goals. While the recent advances in vision language …
and diverse environments and task goals. While the recent advances in vision language …
Towards open vocabulary learning: A survey
In the field of visual scene understanding, deep neural networks have made impressive
advancements in various core tasks like segmentation, tracking, and detection. However …
advancements in various core tasks like segmentation, tracking, and detection. However …
T-rex2: Towards generic object detection via text-visual prompt synergy
We present T-Rex2, a highly practical model for open-set object detection. Previous open-
set object detection methods relying on text prompts effectively encapsulate the abstract …
set object detection methods relying on text prompts effectively encapsulate the abstract …
Open-vocabulary SAM: Segment and recognize twenty-thousand classes interactively
Abstract The CLIP and Segment Anything Model (SAM) are remarkable vision foundation
models (VFMs). SAM excels in segmentation tasks across diverse domains, whereas CLIP is …
models (VFMs). SAM excels in segmentation tasks across diverse domains, whereas CLIP is …