Foundation Models Defining a New Era in Vision: a Survey and Outlook
Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …
fundamental to understanding our world. The complex relations between objects and their …
Advances in medical image analysis with vision transformers: a comprehensive review
The remarkable performance of the Transformer architecture in natural language processing
has recently also triggered broad interest in Computer Vision. Among other merits …
has recently also triggered broad interest in Computer Vision. Among other merits …
Yolov9: Learning what you want to learn using programmable gradient information
Today's deep learning methods focus on how to design the objective functions to make the
prediction as close as possible to the target. Meanwhile, an appropriate neural network …
prediction as close as possible to the target. Meanwhile, an appropriate neural network …
Segment everything everywhere all at once
In this work, we present SEEM, a promotable and interactive model for segmenting
everything everywhere all at once in an image. In SEEM, we propose a novel and versatile …
everything everywhere all at once in an image. In SEEM, we propose a novel and versatile …
Vision mamba: Efficient visual representation learning with bidirectional state space model
Recently the state space models (SSMs) with efficient hardware-aware designs, ie, the
Mamba deep learning model, have shown great potential for long sequence modeling …
Mamba deep learning model, have shown great potential for long sequence modeling …
Internimage: Exploring large-scale vision foundation models with deformable convolutions
Compared to the great progress of large-scale vision transformers (ViTs) in recent years,
large-scale models based on convolutional neural networks (CNNs) are still in an early …
large-scale models based on convolutional neural networks (CNNs) are still in an early …
Dual aggregation transformer for image super-resolution
Transformer has recently gained considerable popularity in low-level vision tasks, including
image super-resolution (SR). These networks utilize self-attention along different …
image super-resolution (SR). These networks utilize self-attention along different …
Generalized decoding for pixel, image, and language
We present X-Decoder, a generalized decoding model that can predict pixel-level
segmentation and language tokens seamlessly. X-Decoder takes as input two types of …
segmentation and language tokens seamlessly. X-Decoder takes as input two types of …
Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset
Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …
while other modalities such as audio and subtitles in videos have not received sufficient …
S4nd: Modeling images and videos as multidimensional signals with state spaces
Visual data such as images and videos are typically modeled as discretizations of inherently
continuous, multidimensional signals. Existing continuous-signal models attempt to exploit …
continuous, multidimensional signals. Existing continuous-signal models attempt to exploit …