Google Академик

[HTML][HTML] A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas

J Terven, DM Córdova-Esparza… - Machine learning and …, 2023 - mdpi.com

YOLO has become a central real-time object detection system for robotics, driverless cars,
and video monitoring applications. We present a comprehensive analysis of YOLO's …

Сачувај Цитирај 2030 пута наведен Сродни чланци Све верзије (7) Кеширано

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Foundation Models Defining a New Era in Vision: a Survey and Outlook

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org

Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

Сачувај Цитирај 148 пута наведен Сродни чланци Све верзије (4)

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Cogvlm: Visual expert for pretrained language models

W Wang, Q Lv, W Yu, W Hong, J Qi… - Advances in …, 2025 - proceedings.neurips.cc

We introduce CogVLM, a powerful open-source visual language foundation model. Different
from the popular\emph {shallow alignment} method which maps image features into the …

Сачувај Цитирај 592 пута наведен Сродни чланци Све верзије (5) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

S Liu, Z Zeng, T Ren, F Li, H Zhang, J Yang… - … on Computer Vision, 2024 - Springer

In this paper, we develop an open-set object detector, called Grounding DINO, by marrying
Transformer-based detector DINO with grounded pre-training, which can detect arbitrary …

Сачувај Цитирај 1695 пута наведен Сродни чланци Све верзије (7)

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Glipv2: Unifying localization and vision-language understanding

H Zhang, P Zhang, X Hu, YC Chen… - Advances in …, 2022 - proceedings.neurips.cc

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks
(eg, object detection, instance segmentation) and Vision-Language (VL) understanding …

Сачувај Цитирај 308 пута наведен Сродни чланци Све верзије (4) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Llavar: Enhanced visual instruction tuning for text-rich image understanding

Y Zhang, R Zhang, J Gu, Y Zhou, N Lipka… - arxiv preprint arxiv …, 2023 - arxiv.org

Instruction tuning unlocks the superior capability of Large Language Models (LLM) to
interact with humans. Furthermore, recent instruction-following datasets include images as …

Сачувај Цитирај 201 пута наведен Сродни чланци Све верзије (2) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Grounded language-image pre-training

LH Li, P Zhang, H Zhang, J Yang, C Li… - Proceedings of the …, 2022 - openaccess.thecvf.com

This paper presents a grounded language-image pre-training (GLIP) model for learning
object-level, language-aware, and semantic-rich visual representations. GLIP unifies object …

Сачувај Цитирај 1175 пута наведен Сродни чланци Све верзије (8) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vim: Out-of-distribution with virtual-logit matching

H Wang, Z Li, L Feng, W Zhang - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

Most of the existing Out-Of-Distribution (OOD) detection algorithms depend on single input
source: the feature, the logit, or the softmax probability. However, the immense diversity of …

Сачувај Цитирај 363 пута наведен Сродни чланци Све верзије (6) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vector quantized diffusion model for text-to-image synthesis

S Gu, D Chen, J Bao, F Wen, B Zhang… - Proceedings of the …, 2022 - openaccess.thecvf.com

We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation.
This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent …

Сачувај Цитирај 861 пута наведен Сродни чланци Све верзије (9) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Scene text recognition with permuted autoregressive sequence models

D Bautista, R Atienza - European conference on computer vision, 2022 - Springer

Context-aware STR methods typically use internal autoregressive (AR) language models
(LM). Inherent limitations of AR models motivated two-stage methods which employ an …

Сачувај Цитирај 212 пута наведен Сродни чланци Све верзије (8)

Направи обавештење

Цитирај

Напредна претрага

Сачувано у мојој библиотеци

Openimages: A public dataset for large-scale multi-label and multi-class image classification

[HTML][HTML] A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas

Foundation Models Defining a New Era in Vision: a Survey and Outlook

Cogvlm: Visual expert for pretrained language models

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Glipv2: Unifying localization and vision-language understanding

Llavar: Enhanced visual instruction tuning for text-rich image understanding

Grounded language-image pre-training

Vim: Out-of-distribution with virtual-logit matching

Vector quantized diffusion model for text-to-image synthesis

Scene text recognition with permuted autoregressive sequence models