Foundation Models Defining a New Era in Vision: a Survey and Outlook
Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …
fundamental to understanding our world. The complex relations between objects and their …
[HTML][HTML] Review of large vision models and visual prompt engineering
Visual prompt engineering is a fundamental methodology in the field of visual and image
artificial general intelligence. As the development of large vision models progresses, the …
artificial general intelligence. As the development of large vision models progresses, the …
Sam 2: Segment anything in images and videos
We present Segment Anything Model 2 (SAM 2), a foundation model towards solving
promptable visual segmentation in images and videos. We build a data engine, which …
promptable visual segmentation in images and videos. We build a data engine, which …
Tracking anything with decoupled video segmentation
Training data for video segmentation are expensive to annotate. This impedes extensions of
end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary …
end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary …
Codef: Content deformation fields for temporally consistent video processing
We present the content deformation field (CoDeF) as a new type of video representation
which consists of a canonical content field aggregating the static contents in the entire video …
which consists of a canonical content field aggregating the static contents in the entire video …
Segment anything is not always perfect: An investigation of sam on different real-world applications
Abstract Recently, Meta AI Research approaches a general, promptable segment anything
model (SAM) pre-trained on an unprecedentedly large segmentation dataset (SA-1B) …
model (SAM) pre-trained on an unprecedentedly large segmentation dataset (SA-1B) …
Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding
Current universal segmentation methods demonstrate strong capabilities in pixel-level
image and video understanding. However, they lack reasoning abilities and cannot be …
image and video understanding. However, they lack reasoning abilities and cannot be …
Efficientsam: Leveraged masked image pretraining for efficient segment anything
Abstract Segment Anything Model (SAM) has emerged as a powerful tool for numerous
vision applications. A key component that drives the impressive performance for zero-shot …
vision applications. A key component that drives the impressive performance for zero-shot …
Evalcrafter: Benchmarking and evaluating large video generation models
The vision and language generative models have been overgrown in recent years. For
video generation various open-sourced models and public-available services have been …
video generation various open-sourced models and public-available services have been …
Langsplat: 3d language gaussian splatting
Humans live in a 3D world and commonly use natural language to interact with a 3D scene.
Modeling a 3D language field to support open-ended language queries in 3D has gained …
Modeling a 3D language field to support open-ended language queries in 3D has gained …