Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

P Xu, W Shao, K Zhang, P Gao, S Liu… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Large Vision-Language Models (LVLMs) have recently played a dominant role in
multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation …

Llavar: Enhanced visual instruction tuning for text-rich image understanding

Y Zhang, R Zhang, J Gu, Y Zhou, N Lipka… - arxiv preprint arxiv …, 2023 - arxiv.org
Instruction tuning unlocks the superior capability of Large Language Models (LLM) to
interact with humans. Furthermore, recent instruction-following datasets include images as …

Docvqa: A dataset for vqa on document images

M Mathew, D Karatzas… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
We present a new dataset for Visual Question Answering (VQA) on document images called
DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images …

Document understanding dataset and evaluation (dude)

J Van Landeghem, R Tito… - Proceedings of the …, 2023 - openaccess.thecvf.com
We call on the Document AI (DocAI) community to re-evaluate current methodologies and
embrace the challenge of creating more practically-oriented benchmarks. Document …

Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding

H Feng, Q Liu, H Liu, J Tang, W Zhou, H Li… - Science China …, 2024 - Springer
In this work, we present DocPedia, a novel large multimodal model (LMM) for versatile OCR-
free document understanding, capable of parsing images up to 2560× 2560 resolution …

Scene text visual question answering

AF Biten, R Tito, A Mafla, L Gomez… - Proceedings of the …, 2019 - openaccess.thecvf.com
Current visual question answering datasets do not consider the rich semantic information
conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims …

Ernie-layout: Layout knowledge enhanced pre-training for visually-rich document understanding

Q Peng, Y Pan, W Wang, B Luo, Z Zhang… - arxiv preprint arxiv …, 2022 - arxiv.org
Recent years have witnessed the rise and success of pre-training techniques in visually-rich
document understanding. However, most existing methods lack the systematic mining and …

Going full-tilt boogie on document understanding with text-image-layout transformer

R Powalski, Ł Borchmann, D Jurkiewicz… - Document Analysis and …, 2021 - Springer
We address the challenging problem of Natural Language Comprehension beyond plain-
text documents by introducing the TILT neural network architecture which simultaneously …

Iterative answer prediction with pointer-augmented multimodal transformers for textvqa

R Hu, A Singh, T Darrell… - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com
Many visual scenes contain text that carries crucial information, and it is thus essential to
understand text in images for downstream reasoning tasks. For example, a deep water label …

Docformerv2: Local features for document understanding

S Appalaraju, P Tang, Q Dong, N Sankaran… - Proceedings of the …, 2024 - ojs.aaai.org
We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding
(VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) …