[HTML][HTML] A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas
YOLO has become a central real-time object detection system for robotics, driverless cars,
and video monitoring applications. We present a comprehensive analysis of YOLO's …
and video monitoring applications. We present a comprehensive analysis of YOLO's …
Foundation Models Defining a New Era in Vision: a Survey and Outlook
Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …
fundamental to understanding our world. The complex relations between objects and their …
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
In this paper, we develop an open-set object detector, called Grounding DINO, by marrying
Transformer-based detector DINO with grounded pre-training, which can detect arbitrary …
Transformer-based detector DINO with grounded pre-training, which can detect arbitrary …
Cogvlm: Visual expert for pretrained language models
We introduce CogVLM, a powerful open-source visual language foundation model. Different
from the popular shallow alignment method which maps image features into the input space …
from the popular shallow alignment method which maps image features into the input space …
Glipv2: Unifying localization and vision-language understanding
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks
(eg, object detection, instance segmentation) and Vision-Language (VL) understanding …
(eg, object detection, instance segmentation) and Vision-Language (VL) understanding …
Grounded language-image pre-training
This paper presents a grounded language-image pre-training (GLIP) model for learning
object-level, language-aware, and semantic-rich visual representations. GLIP unifies object …
object-level, language-aware, and semantic-rich visual representations. GLIP unifies object …
Vector quantized diffusion model for text-to-image synthesis
We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation.
This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent …
This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent …
Vim: Out-of-distribution with virtual-logit matching
Most of the existing Out-Of-Distribution (OOD) detection algorithms depend on single input
source: the feature, the logit, or the softmax probability. However, the immense diversity of …
source: the feature, the logit, or the softmax probability. However, the immense diversity of …
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
The availability of large-scale image captioning and visual question answering datasets has
contributed significantly to recent successes in vision-and-language pre-training. However …
contributed significantly to recent successes in vision-and-language pre-training. However …
Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection
Open-world object detection, as a more general and challenging goal, aims to recognize
and localize objects described by arbitrary category names. The recent work GLIP …
and localize objects described by arbitrary category names. The recent work GLIP …