Text recognition in the wild: A survey
The history of text can be traced back over thousands of years. Rich and precise semantic
information carried by text is important in a wide range of vision-based application …
information carried by text is important in a wide range of vision-based application …
Medical visual question answering: A survey
Abstract Medical Visual Question Answering (VQA) is a combination of medical artificial
intelligence and popular VQA challenges. Given a medical image and a clinically relevant …
intelligence and popular VQA challenges. Given a medical image and a clinically relevant …
Monkey: Image resolution and text label are important things for large multi-modal models
Z Li, B Yang, Q Liu, Z Ma, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Large Multimodal Models (LMMs) have shown promise in vision-language tasks but
struggle with high-resolution input and detailed scene understanding. Addressing these …
struggle with high-resolution input and detailed scene understanding. Addressing these …
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
The exponential growth of large language models (LLMs) has opened up numerous
possibilities for multi-modal AGI systems. However the progress in vision and vision …
possibilities for multi-modal AGI systems. However the progress in vision and vision …
Docvqa: A dataset for vqa on document images
We present a new dataset for Visual Question Answering (VQA) on document images called
DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images …
DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images …
On the hidden mystery of ocr in large multimodal models
Large models have recently played a dominant role in natural language processing and
multimodal vision-language learning. However, their effectiveness in text-related visual …
multimodal vision-language learning. However, their effectiveness in text-related visual …
Latr: Layout-aware transformer for scene-text vqa
We propose a novel multimodal architecture for Scene Text Visual Question Answering
(STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA requires models to …
(STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA requires models to …
Tap: Text-aware pre-training for text-vqa and text-caption
In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption
tasks. These two tasks aim at reading and understanding scene text in images for question …
tasks. These two tasks aim at reading and understanding scene text in images for question …
Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text
A crucial component for the scene text based reasoning required for TextVQA and TextCaps
datasets involve detecting and recognizing text present in the images using an optical …
datasets involve detecting and recognizing text present in the images using an optical …
Progressive contour regression for arbitrary-shape scene text detection
State-of-the-art scene text detection methods usually model the text instance with local
pixels or components from the bottom-up perspective and, therefore, are sensitive to noises …
pixels or components from the bottom-up perspective and, therefore, are sensitive to noises …