Sora: A review on background, technology, limitations, and opportunities of large vision models
Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The
model is trained to generate videos of realistic or imaginative scenes from text instructions …
model is trained to generate videos of realistic or imaginative scenes from text instructions …
Automated diagnosis of cardiovascular diseases from cardiac magnetic resonance imaging using deep learning models: A review
In recent years, cardiovascular diseases (CVDs) have become one of the leading causes of
mortality globally. At early stages, CVDs appear with minor symptoms and progressively get …
mortality globally. At early stages, CVDs appear with minor symptoms and progressively get …
Focal modulation networks
We propose focal modulation networks (FocalNets in short), where self-attention (SA) is
completely replaced by a focal modulation module for modeling token interactions in vision …
completely replaced by a focal modulation module for modeling token interactions in vision …
Patch n'pack: Navit, a vision transformer for any aspect ratio and resolution
The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution
before processing them with computer vision models has not yet been successfully …
before processing them with computer vision models has not yet been successfully …
Flexivit: One model for all patch sizes
Vision Transformers convert images to sequences by slicing them into patches. The size of
these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher …
these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher …
Propainter: Improving propagation and transformer for video inpainting
Flow-based propagation and spatiotemporal Transformer are two mainstream mechanisms
in video inpainting (VI). Despite the effectiveness of these components, they still suffer from …
in video inpainting (VI). Despite the effectiveness of these components, they still suffer from …
Global context vision transformers
We propose global context vision transformer (GC ViT), a novel architecture that enhances
parameter and compute utilization for computer vision. Our method leverages global context …
parameter and compute utilization for computer vision. Our method leverages global context …
Scale-aware modulation meet transformer
This paper presents a new vision Transformer, Scale Aware Modulation Transformer (SMT),
that can handle various downstream tasks efficiently by combining the convolutional network …
that can handle various downstream tasks efficiently by combining the convolutional network …
Efficientad: Accurate visual anomaly detection at millisecond-level latencies
Detecting anomalies in images is an important task, especially in real-time computer vision
applications. In this work, we focus on computational efficiency and propose a lightweight …
applications. In this work, we focus on computational efficiency and propose a lightweight …
Token merging for fast stable diffusion
The landscape of image generation has been forever changed by open vocabulary diffusion
models. However, at their core these models use transformers, which makes generation …
models. However, at their core these models use transformers, which makes generation …