Monkey: Image resolution and text label are important things for large multi-modal models

Z Li, B Yang, Q Liu, Z Ma, S Zhang… - proceedings of the …, 2024‏ - openaccess.thecvf.com
Abstract Large Multimodal Models (LMMs) have shown promise in vision-language tasks but
struggle with high-resolution input and detailed scene understanding. Addressing these …

Minicpm-v: A gpt-4v level mllm on your phone

Y Yao, T Yu, A Zhang, C Wang, J Cui, H Zhu… - arxiv preprint arxiv …, 2024‏ - arxiv.org
The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally
reshaped the landscape of AI research and industry, shedding light on a promising path …

Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model

J Ye, A Hu, H Xu, Q Ye, M Yan, G Xu, C Li… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Text is ubiquitous in our visual world, conveying crucial information, such as in documents,
websites, and everyday photographs. In this work, we propose UReader, a first exploration …

Sphinx-x: Scaling data and parameters for a family of multi-modal large language models

D Liu, R Zhang, L Qiu, S Huang, W Lin, S Zhao… - arxiv preprint arxiv …, 2024‏ - arxiv.org
We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series
developed upon SPHINX. To improve the architecture and training efficiency, we modify the …

mplug-docowl 1.5: Unified structure learning for ocr-free document understanding

A Hu, H Xu, J Ye, M Yan, L Zhang, B Zhang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Structure information is critical for understanding the semantics of text-rich images, such as
documents, tables, and charts. Existing Multimodal Large Language Models (MLLMs) for …

mplug-docowl: Modularized multimodal large language model for document understanding

J Ye, A Hu, H Xu, Q Ye, M Yan, Y Dan, C Zhao… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Document understanding refers to automatically extract, analyze and comprehend
information from various types of digital documents, such as a web page. Existing Multi …

Docformer: End-to-end transformer for document understanding

S Appalaraju, B Jasani, BU Kota… - Proceedings of the …, 2021‏ - openaccess.thecvf.com
We present DocFormer-a multi-modal transformer based architecture for the task of Visual
Document Understanding (VDU). VDU is a challenging problem which aims to understand …

Unifying vision, text, and layout for universal document processing

Z Tang, Z Yang, G Wang, Y Fang… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
Abstract We propose Universal Document Processing (UDOP), a foundation Document AI
model which unifies text, image, and layout modalities together with varied task formats …

Layoutlmv2: Multi-modal pre-training for visually-rich document understanding

Y Xu, Y Xu, T Lv, L Cui, F Wei, G Wang, Y Lu… - arxiv preprint arxiv …, 2020‏ - arxiv.org
Pre-training of text and layout has proved effective in a variety of visually-rich document
understanding tasks due to its effective model architecture and the advantage of large-scale …

Mmlongbench-doc: Benchmarking long-context document understanding with visualizations

Y Ma, Y Zang, L Chen, M Chen, Y Jiao… - Advances in …, 2025‏ - proceedings.neurips.cc
Understanding documents with rich layouts and multi-modal components is a long-standing
and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable …