Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models
Large Vision-Language Models (LVLMs) have recently played a dominant role in
multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation …
multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation …
Llavar: Enhanced visual instruction tuning for text-rich image understanding
Instruction tuning unlocks the superior capability of Large Language Models (LLM) to
interact with humans. Furthermore, recent instruction-following datasets include images as …
interact with humans. Furthermore, recent instruction-following datasets include images as …
Docvqa: A dataset for vqa on document images
We present a new dataset for Visual Question Answering (VQA) on document images called
DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images …
DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images …
Document understanding dataset and evaluation (dude)
We call on the Document AI (DocAI) community to re-evaluate current methodologies and
embrace the challenge of creating more practically-oriented benchmarks. Document …
embrace the challenge of creating more practically-oriented benchmarks. Document …
Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding
In this work, we present DocPedia, a novel large multimodal model (LMM) for versatile OCR-
free document understanding, capable of parsing images up to 2560× 2560 resolution …
free document understanding, capable of parsing images up to 2560× 2560 resolution …
Scene text visual question answering
Current visual question answering datasets do not consider the rich semantic information
conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims …
conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims …
Ernie-layout: Layout knowledge enhanced pre-training for visually-rich document understanding
Recent years have witnessed the rise and success of pre-training techniques in visually-rich
document understanding. However, most existing methods lack the systematic mining and …
document understanding. However, most existing methods lack the systematic mining and …
Going full-tilt boogie on document understanding with text-image-layout transformer
We address the challenging problem of Natural Language Comprehension beyond plain-
text documents by introducing the TILT neural network architecture which simultaneously …
text documents by introducing the TILT neural network architecture which simultaneously …
Iterative answer prediction with pointer-augmented multimodal transformers for textvqa
Many visual scenes contain text that carries crucial information, and it is thus essential to
understand text in images for downstream reasoning tasks. For example, a deep water label …
understand text in images for downstream reasoning tasks. For example, a deep water label …
Docformerv2: Local features for document understanding
We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding
(VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) …
(VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) …