Document parsing unveiled: Techniques, challenges, and prospects for structured information extraction

Q Zhang, VSJ Huang, B Wang, J Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Document parsing is essential for converting unstructured and semi-structured documents-
such as contracts, academic papers, and invoices-into structured, machine-readable data …

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

J Zhang, Q Zhang, B Wang, L Ouyang, Z Wen… - arxiv preprint arxiv …, 2024 - arxiv.org
Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by
integrating external knowledge to reduce hallucinations and incorporate up-to-date …

VISA: Retrieval Augmented Generation with Visual Source Attribution

X Ma, S Zhuang, B Koopman, G Zuccon… - arxiv preprint arxiv …, 2024 - arxiv.org
Generation with source attribution is important for enhancing the verifiability of retrieval-
augmented generation (RAG) systems. However, existing approaches in RAG primarily link …

UniCoRN: Unified Commented Retrieval Network with LMMs

M Jaritz, M Guillaumin, S Sternig, L Bazzani - arxiv preprint arxiv …, 2025 - arxiv.org
Multimodal retrieval methods have limitations in handling complex, compositional queries
that require reasoning about the visual content of both the query and the retrieved entities …

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

J Zhou, Z Liu, Z Liu, S **ao, Y Wang, B Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite the rapidly growing demand for multimodal retrieval, progress in this field remains
severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a …

Document Screenshot Retrievers are Vulnerable to Pixel Poisoning Attacks

S Zhuang, E Khramtsova, X Ma, B Koopman… - arxiv preprint arxiv …, 2025 - arxiv.org
Recent advancements in dense retrieval have introduced vision-language model (VLM)-
based retrievers, such as DSE and ColPali, which leverage document screenshots …

LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating

C Deng, J Yuan, P Bu, P Wang, ZZ Li, J Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large vision language models (LVLMs) have improved the document understanding
capabilities remarkably, enabling the handling of complex document elements, longer …

VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos

X Ren, L Xu, L **a, S Wang, D Yin, C Huang - arxiv preprint arxiv …, 2025 - arxiv.org
Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in
enhancing Large Language Models (LLMs) through external knowledge integration, yet its …

How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey

Y Qi, H Li, Y Song, X Wu, J Luo - arxiv preprint arxiv:2412.08158, 2024 - arxiv.org
The exploration of various vision-language tasks, such as visual captioning, visual question
answering, and visual commonsense reasoning, is an important area in artificial intelligence …

An archaeological Catalog Collection Method Based on Large Vision-Language Models

H Pang, Y Chang, T Duan, X Yang - arxiv preprint arxiv:2412.20088, 2024 - arxiv.org
Archaeological catalogs, containing key elements such as artifact images, morphological
descriptions, and excavation information, are essential for studying artifact evolution and …