- Academic Search

X Chen, L **, Y Zhu, C Luo, T Wang - ACM Computing Surveys (CSUR), 2021 - dl.acm.org

The history of text can be traced back over thousands of years. Rich and precise semantic
information carried by text is important in a wide range of vision-based application …

保存引用被引用次数：256 相关文章所有 5 个版本

[Free GPT-4]

[PDF] arxiv.org

Medical visual question answering: A survey

Z Lin, D Zhang, Q Tao, D Shi, G Haffari, Q Wu… - Artificial Intelligence in …, 2023 - Elsevier

Abstract Medical Visual Question Answering (VQA) is a combination of medical artificial
intelligence and popular VQA challenges. Given a medical image and a clinically relevant …

保存引用被引用次数：112 相关文章所有 8 个版本

[Free GPT-4]

[PDF] thecvf.com

Monkey: Image resolution and text label are important things for large multi-modal models

Z Li, B Yang, Q Liu, Z Ma, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Large Multimodal Models (LMMs) have shown promise in vision-language tasks but
struggle with high-resolution input and detailed scene understanding. Addressing these …

保存引用被引用次数：209 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] thecvf.com

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Z Chen, J Wu, W Wang, W Su, G Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com

The exponential growth of large language models (LLMs) has opened up numerous
possibilities for multi-modal AGI systems. However the progress in vision and vision …

保存引用被引用次数：167 相关文章所有 4 个版本 HTML 版

[Free GPT-4]

[PDF] thecvf.com

Docvqa: A dataset for vqa on document images

M Mathew, D Karatzas… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

We present a new dataset for Visual Question Answering (VQA) on document images called
DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images …

保存引用被引用次数：582 相关文章所有 8 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

On the hidden mystery of ocr in large multimodal models

Y Liu, Z Li, B Yang, C Li, X Yin, C Liu, L **… - arxiv preprint arxiv …, 2023 - arxiv.org

Large models have recently played a dominant role in natural language processing and
multimodal vision-language learning. However, their effectiveness in text-related visual …

保存引用被引用次数：174 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] thecvf.com

Latr: Layout-aware transformer for scene-text vqa

AF Biten, R Litman, Y **e… - Proceedings of the …, 2022 - openaccess.thecvf.com

We propose a novel multimodal architecture for Scene Text Visual Question Answering
(STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA requires models to …

保存引用被引用次数：105 相关文章所有 7 个版本 HTML 版

[Free GPT-4]

[PDF] thecvf.com

Tap: Text-aware pre-training for text-vqa and text-caption

Z Yang, Y Lu, J Wang, X Yin… - Proceedings of the …, 2021 - openaccess.thecvf.com

In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption
tasks. These two tasks aim at reading and understanding scene text in images for question …

保存引用被引用次数：181 相关文章所有 8 个版本 HTML 版

[Free GPT-4]

[PDF] thecvf.com

Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text

A Singh, G Pang, M Toh, J Huang… - Proceedings of the …, 2021 - openaccess.thecvf.com

A crucial component for the scene text based reasoning required for TextVQA and TextCaps
datasets involve detecting and recognizing text present in the images using an optical …

保存引用被引用次数：183 相关文章所有 7 个版本 HTML 版

[Free GPT-4]

[PDF] thecvf.com

Progressive contour regression for arbitrary-shape scene text detection

P Dai, S Zhang, H Zhang, X Cao - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

State-of-the-art scene text detection methods usually model the text instance with local
pixels or components from the bottom-up perspective and, therefore, are sensitive to noises …

保存引用被引用次数：133 相关文章所有 4 个版本 HTML 版

创建快讯

引用

高级搜索

已保存到“我的图书馆”

On the general value of evidence, and bilingual scene-text visual question answering

Text recognition in the wild: A survey

Medical visual question answering: A survey

Monkey: Image resolution and text label are important things for large multi-modal models

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Docvqa: A dataset for vqa on document images

On the hidden mystery of ocr in large multimodal models

Latr: Layout-aware transformer for scene-text vqa

Tap: Text-aware pre-training for text-vqa and text-caption

Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text

Progressive contour regression for arbitrary-shape scene text detection