Foundation Models Defining a New Era in Vision: a Survey and Outlook
Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …
fundamental to understanding our world. The complex relations between objects and their …
Segment anything in high quality
Abstract The recent Segment Anything Model (SAM) represents a big leap in scaling up
segmentation models, allowing for powerful zero-shot capabilities and flexible prompting …
segmentation models, allowing for powerful zero-shot capabilities and flexible prompting …
Foundation models in robotics: Applications, challenges, and the future
We survey applications of pretrained foundation models in robotics. Traditional deep
learning models in robotics are trained on small datasets tailored for specific tasks, which …
learning models in robotics are trained on small datasets tailored for specific tasks, which …
Gaussian grou**: Segment and edit anything in 3d scenes
Abstract The recent Gaussian Splatting achieves high-quality and real-time novel-view
synthesis of the 3D scenes. However, it is solely concentrated on the appearance and …
synthesis of the 3D scenes. However, it is solely concentrated on the appearance and …
Sam-clip: Merging vision foundation models towards semantic and spatial understanding
The landscape of publicly available vision foundation models (VFMs) such as CLIP and
SAM is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their …
SAM is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their …
Repvit: Revisiting mobile cnn from vit perspective
Abstract Recently lightweight Vision Transformers (ViTs) demonstrate superior performance
and lower latency compared with lightweight Convolutional Neural Networks (CNNs) on …
and lower latency compared with lightweight Convolutional Neural Networks (CNNs) on …
Tracking anything with decoupled video segmentation
Training data for video segmentation are expensive to annotate. This impedes extensions of
end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary …
end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary …
General in-hand object rotation with vision and touch
We introduce Rotateit, a system that enables fingertip-based object rotation along multiple
axes by leveraging multimodal sensory inputs. Our system is trained in simulation, where it …
axes by leveraging multimodal sensory inputs. Our system is trained in simulation, where it …
Efficientsam: Leveraged masked image pretraining for efficient segment anything
Abstract Segment Anything Model (SAM) has emerged as a powerful tool for numerous
vision applications. A key component that drives the impressive performance for zero-shot …
vision applications. A key component that drives the impressive performance for zero-shot …
Weakly-supervised semantic segmentation with image-level labels: from traditional models to foundation models
The rapid development of deep learning has driven significant progress in image semantic
segmentation—a fundamental task in computer vision. Semantic segmentation algorithms …
segmentation—a fundamental task in computer vision. Semantic segmentation algorithms …