Colpali: Efficient document retrieval with vision language models

M Faysse, H Sibille, T Wu, B Omrani… - The Thirteenth …, 2024 - openreview.net
Documents are visually rich structures that convey information through text, but also figures,
page layouts, tables, or even fonts. Since modern retrieval systems mainly rely on the textual …

Mint-1t: Scaling open-source multimodal data by 10x: A multimodal dataset with one trillion tokens

A Awadalla, L Xue, O Lo, M Shu, H Lee… - The Thirty-eight …, 2024 - openreview.net
Multimodal interleaved datasets featuring free-form interleaved sequences of images and
text are crucial for training frontier large multimodal models (LMMs). Despite the rapid …

Points: Improving your vision-language model with affordable strategies

Y Liu, Z Zhao, Z Zhuang, L Tian, X Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, vision-language models have made significant strides, excelling in tasks like
optical character recognition and geometric problem-solving. However, several critical …

Task Vectors are Cross-Modal

G Luo, T Darrell, A Bar - arxiv preprint arxiv:2410.22330, 2024 - arxiv.org
We investigate the internal representations of vision-and-language models (VLMs) and how
they encode task representations. We consider tasks specified through examples or …

Comics Datasets Framework: Mix of Comics datasets for detection benchmarking

E Vivoli, I Campaioli, M Nardoni, N Biondi… - … on Document Analysis …, 2024 - Springer
Comics, as a medium, uniquely combine text and images in styles often distinct from real-
world visuals. For the past three decades, computational research on comics has evolved …

ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting

A Mishra, R Noh, H Fu, M Li, M Kim - arxiv preprint arxiv:2502.14780, 2025 - arxiv.org
Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern
smartphones with powerful cameras become primary interfaces for human-computer …

Retrospective Learning from Interactions

Z Chen, MO Gul, Y Chen, G Geng, A Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
Multi-turn interactions between large language models (LLMs) and users naturally include
implicit feedback signals. If an LLM responds in an unexpected way to an instruction, the …