Deep Learning based Visually Rich Document Content Understanding: A Survey

Y Ding, J Lee, SC Han - arxiv preprint arxiv:2408.01287, 2024 - arxiv.org
Visually Rich Documents (VRDs) are essential in academia, finance, medical fields, and
marketing due to their multimodal information content. Traditional methods for extracting …

Bluelm-v-3b: Algorithm and system co-design for multimodal large language models on mobile devices

X Lu, Y Chen, C Chen, H Tan, B Chen, Y **e… - arxiv preprint arxiv …, 2024 - arxiv.org
The emergence and growing popularity of multimodal large language models (MLLMs) have
significant potential to enhance various aspects of daily life, from improving communication …

Privacy-aware document visual question answering

R Tito, K Nguyen, M Tobaben, R Kerkouche… - … on Document Analysis …, 2024 - Springer
Abstract Document Visual Question Answering (DocVQA) has quickly grown into a central
task of document understanding. But despite the fact that documents contain sensitive or …

Overview of DocILE 2023: Document Information Localization and Extraction

Š Šimsa, M Uřičář, M Šulc, Y Patel, A Hamdi… - … Conference of the Cross …, 2023 - Springer
This paper provides an overview of the DocILE 2023 Competition, its tasks, participant
submissions, the competition results and possible future research directions. This first …

Towards a new research agenda for multimodal enterprise document understanding: What are we missing?

A Nourbakhsh, S Shah, C Rose - Findings of the Association for …, 2024 - aclanthology.org
The field of multimodal document understanding has produced a suite of models that have
achieved stellar performance across several tasks, even coming close to human …

Towards reducing hallucination in extracting information from financial reports using Large Language Models

B Sarmah, D Mehta, S Pasquali, T Zhu - Proceedings of the Third …, 2023 - dl.acm.org
For a financial analyst, the question and answer (Q&A) segment of the company financial
report is a crucial piece of information for various analysis and investment decisions …

Beyond Document Page Classification: Design, Datasets, and Challenges

J Van Landeghem, S Biswas… - Proceedings of the …, 2024 - openaccess.thecvf.com
This paper highlights the need to bring document classification benchmarking closer to real-
world applications, both in the nature of data tested (X: multi-channel, multi-paged, multi …

MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

F Zhu, Z Liu, XY Ng, H Wu, W Wang, F Feng… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Vision-Language Models (LVLMs) have achieved remarkable performance in many
vision-language tasks, yet their capabilities in fine-grained visual understanding remain …

WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

X **e, H Yan, L Yin, Y Liu, J Ding, M Liao, Y Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal document understanding is a challenging task to process and comprehend large
amounts of textual and visual information. Recent advances in Large Language Models …

Deep learning approaches for information extraction from visually rich documents: datasets, challenges and methods

H Gbada, K Kalti, MA Mahjoub - International Journal on Document …, 2024 - Springer
This paper focuses on Information Extraction from Visually Rich Documents, exploring how
deep learning methods are applied in this field. For the purpose of comparing the …