Text recognition in the wild: A survey

X Chen, L **, Y Zhu, C Luo, T Wang - ACM Computing Surveys (CSUR), 2021 - dl.acm.org
The history of text can be traced back over thousands of years. Rich and precise semantic
information carried by text is important in a wide range of vision-based application …

Medical visual question answering: A survey

Z Lin, D Zhang, Q Tao, D Shi, G Haffari, Q Wu… - Artificial Intelligence in …, 2023 - Elsevier
Abstract Medical Visual Question Answering (VQA) is a combination of medical artificial
intelligence and popular VQA challenges. Given a medical image and a clinically relevant …

Monkey: Image resolution and text label are important things for large multi-modal models

Z Li, B Yang, Q Liu, Z Ma, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Large Multimodal Models (LMMs) have shown promise in vision-language tasks but
struggle with high-resolution input and detailed scene understanding. Addressing these …

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Z Chen, J Wu, W Wang, W Su, G Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
The exponential growth of large language models (LLMs) has opened up numerous
possibilities for multi-modal AGI systems. However the progress in vision and vision …

Docvqa: A dataset for vqa on document images

M Mathew, D Karatzas… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
We present a new dataset for Visual Question Answering (VQA) on document images called
DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images …

On the hidden mystery of ocr in large multimodal models

Y Liu, Z Li, B Yang, C Li, X Yin, C Liu, L **… - arxiv preprint arxiv …, 2023 - arxiv.org
Large models have recently played a dominant role in natural language processing and
multimodal vision-language learning. However, their effectiveness in text-related visual …

Latr: Layout-aware transformer for scene-text vqa

AF Biten, R Litman, Y **e… - Proceedings of the …, 2022 - openaccess.thecvf.com
We propose a novel multimodal architecture for Scene Text Visual Question Answering
(STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA requires models to …

Tap: Text-aware pre-training for text-vqa and text-caption

Z Yang, Y Lu, J Wang, X Yin… - Proceedings of the …, 2021 - openaccess.thecvf.com
In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption
tasks. These two tasks aim at reading and understanding scene text in images for question …

Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text

A Singh, G Pang, M Toh, J Huang… - Proceedings of the …, 2021 - openaccess.thecvf.com
A crucial component for the scene text based reasoning required for TextVQA and TextCaps
datasets involve detecting and recognizing text present in the images using an optical …

Progressive contour regression for arbitrary-shape scene text detection

P Dai, S Zhang, H Zhang, X Cao - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
State-of-the-art scene text detection methods usually model the text instance with local
pixels or components from the bottom-up perspective and, therefore, are sensitive to noises …