A comprehensive review of object detection with deep learning
R Kaur, S Singh - Digital Signal Processing, 2023 - Elsevier
In the realm of computer vision, Deep Convolutional Neural Networks (DCNNs) have
demonstrated excellent performance. Video Processing, Object Detection, Image …
demonstrated excellent performance. Video Processing, Object Detection, Image …
Understanding deep learning techniques for image segmentation
The machine learning community has been overwhelmed by a plethora of deep learning--
based approaches. Many challenging computer vision tasks, such as detection, localization …
based approaches. Many challenging computer vision tasks, such as detection, localization …
How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites
In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …
(MLLM) to bridge the capability gap between open-source and proprietary commercial …
Deepseek-vl: towards real-world vision-language understanding
We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-
world vision and language understanding applications. Our approach is structured around …
world vision and language understanding applications. Our approach is structured around …
Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models
Large Vision-Language Models (LVLMs) have recently played a dominant role in
multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation …
multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation …
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …
Vary: Scaling up the vision vocabulary for large vision-language model
Abstract Most Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary, ie,
CLIP, for common vision tasks. However, for some special task that needs dense and fine …
CLIP, for common vision tasks. However, for some special task that needs dense and fine …
Scene text recognition with permuted autoregressive sequence models
Context-aware STR methods typically use internal autoregressive (AR) language models
(LM). Inherent limitations of AR models motivated two-stage methods which employ an …
(LM). Inherent limitations of AR models motivated two-stage methods which employ an …
On the hidden mystery of ocr in large multimodal models
Large models have recently played a dominant role in natural language processing and
multimodal vision-language learning. However, their effectiveness in text-related visual …
multimodal vision-language learning. However, their effectiveness in text-related visual …
Towards vqa models that can read
Studies have shown that a dominant class of questions asked by visually impaired users on
images of their surroundings involves reading text in the image. But today's VQA models can …
images of their surroundings involves reading text in the image. But today's VQA models can …