Dockylin: A large multimodal model for visual document understanding with efficient visual slimming

J Zhang, W Yang, S Lai, Z **e, L ** - arxiv preprint arxiv:2406.19101, 2024 - arxiv.org
Current multimodal large language models (MLLMs) face significant challenges in visual
document understanding (VDU) tasks due to the high resolution, dense text, and complex …

ST: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming

J Zhuang, L Lu, M Dai, R Hu, J Chen, Q Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) enhance their perceptual capabilities by
integrating visual and textual information. However, processing the massive number of …