A comprehensive review of object detection with deep learning

R Kaur, S Singh - Digital Signal Processing, 2023 - Elsevier
In the realm of computer vision, Deep Convolutional Neural Networks (DCNNs) have
demonstrated excellent performance. Video Processing, Object Detection, Image …

Understanding deep learning techniques for image segmentation

S Ghosh, N Das, I Das, U Maulik - ACM computing surveys (CSUR), 2019 - dl.acm.org
The machine learning community has been overwhelmed by a plethora of deep learning--
based approaches. Many challenging computer vision tasks, such as detection, localization …

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer
In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Deepseek-vl: towards real-world vision-language understanding

H Lu, W Liu, B Zhang, B Wang, K Dong, B Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-
world vision and language understanding applications. Our approach is structured around …

Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

P Xu, W Shao, K Zhang, P Gao, S Liu… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Large Vision-Language Models (LVLMs) have recently played a dominant role in
multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation …

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

Vary: Scaling up the vision vocabulary for large vision-language model

H Wei, L Kong, J Chen, L Zhao, Z Ge, J Yang… - … on Computer Vision, 2024 - Springer
Abstract Most Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary, ie,
CLIP, for common vision tasks. However, for some special task that needs dense and fine …

Scene text recognition with permuted autoregressive sequence models

D Bautista, R Atienza - European conference on computer vision, 2022 - Springer
Context-aware STR methods typically use internal autoregressive (AR) language models
(LM). Inherent limitations of AR models motivated two-stage methods which employ an …

On the hidden mystery of ocr in large multimodal models

Y Liu, Z Li, B Yang, C Li, X Yin, C Liu, L **… - arxiv preprint arxiv …, 2023 - arxiv.org
Large models have recently played a dominant role in natural language processing and
multimodal vision-language learning. However, their effectiveness in text-related visual …

Towards vqa models that can read

A Singh, V Natarajan, M Shah… - Proceedings of the …, 2019 - openaccess.thecvf.com
Studies have shown that a dominant class of questions asked by visually impaired users on
images of their surroundings involves reading text in the image. But today's VQA models can …