„Google“ mokslinčius

Išsaugoti Cituoti Cituoja 200 Susiję straipsniai Visos 2 versijos HTML kopija

Llavar: Enhanced visual instruction tuning for text-rich image understanding

Y Zhang, R Zhang, J Gu, Y Zhou, N Lipka… - arxiv preprint arxiv …, 2023 - arxiv.org

Instruction tuning unlocks the superior capability of Large Language Models (LLM) to
interact with humans. Furthermore, recent instruction-following datasets include images as …

Išsaugoti Cituoti Cituoja 608 Susiję straipsniai Visos 9 versijos HTML kopija

Docvqa: A dataset for vqa on document images

M Mathew, D Karatzas… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

We present a new dataset for Visual Question Answering (VQA) on document images called
DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images …

Išsaugoti Cituoti Cituoja 56 Susiję straipsniai Visos 14 versijos HTML kopija

Document understanding dataset and evaluation (dude)

J Van Landeghem, R Tito… - Proceedings of the …, 2023 - openaccess.thecvf.com

We call on the Document AI (DocAI) community to re-evaluate current methodologies and
embrace the challenge of creating more practically-oriented benchmarks. Document …

Išsaugoti Cituoti Cituoja 48 Susiję straipsniai Visos 3 versijos

Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding

H Feng, Q Liu, H Liu, J Tang, W Zhou, H Li… - Science China …, 2024 - Springer

In this work, we present DocPedia, a novel large multimodal model (LMM) for versatile OCR-
free document understanding, capable of parsing images up to 2560× 2560 resolution …

Išsaugoti Cituoti Cituoja 380 Susiję straipsniai Visos 14 versijos HTML kopija

Scene text visual question answering

AF Biten, R Tito, A Mafla, L Gomez… - Proceedings of the …, 2019 - openaccess.thecvf.com

Current visual question answering datasets do not consider the rich semantic information
conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims …

Išsaugoti Cituoti Cituoja 77 Susiję straipsniai Visos 3 versijos HTML kopija

Ernie-layout: Layout knowledge enhanced pre-training for visually-rich document understanding

Q Peng, Y Pan, W Wang, B Luo, Z Zhang… - arxiv preprint arxiv …, 2022 - arxiv.org

Recent years have witnessed the rise and success of pre-training techniques in visually-rich
document understanding. However, most existing methods lack the systematic mining and …

Išsaugoti Cituoti Cituoja 182 Susiję straipsniai Visos 6 versijos

Going full-tilt boogie on document understanding with text-image-layout transformer

R Powalski, Ł Borchmann, D Jurkiewicz… - Document Analysis and …, 2021 - Springer

We address the challenging problem of Natural Language Comprehension beyond plain-
text documents by introducing the TILT neural network architecture which simultaneously …

Išsaugoti Cituoti Cituoja 243 Susiję straipsniai Visos 7 versijos HTML kopija

Iterative answer prediction with pointer-augmented multimodal transformers for textvqa

R Hu, A Singh, T Darrell… - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com

Many visual scenes contain text that carries crucial information, and it is thus essential to
understand text in images for downstream reasoning tasks. For example, a deep water label …