Bluelm-v-3b: Algorithm and system co-design for multimodal large language models on mobile devices

X Lu, Y Chen, C Chen, H Tan, B Chen, Y **e… - arxiv preprint arxiv …, 2024‏ - arxiv.org
The emergence and growing popularity of multimodal large language models (MLLMs) have
significant potential to enhance various aspects of daily life, from improving communication …

Skip\n: A simple method to reduce hallucination in large vision-language models

Z Han, Z Bai, H Mei, Q Xu, C Zhang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Recent advancements in large vision-language models (LVLMs) have demonstrated
impressive capability in visual information understanding with human language. Despite …

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

AJ Wang, L Li, Y Lin, M Li, L Wang… - Advances in Neural …, 2025‏ - proceedings.neurips.cc
Training models with longer in-context lengths is a significant challenge for multimodal
machine learning due to substantial GPU memory and computational costs. This exploratory …

Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval

G Zeng, Y Zhang, J Wei, D Yang, P Zhang… - Proceedings of the …, 2024‏ - dl.acm.org
Scene text retrieval aims to find all images containing the query text from an image gallery.
Current efforts tend to adopt an Optical Character Recognition (OCR) pipeline, which …

DSDL: Data Set Description Language for Bridging Modalities and Tasks in AI Data

B Wang, L Ouyang, F Wu, W Ning, X Han… - arxiv preprint arxiv …, 2024‏ - arxiv.org
In the era of artificial intelligence, the diversity of data modalities and annotation formats
often renders data unusable directly, requiring understanding and format conversion before …

Type-R: Automatically Retouching Typos for Text-to-Image Generation

W Shimoda, N Inoue, D Haraguchi, H Mitani… - arxiv preprint arxiv …, 2024‏ - arxiv.org
While recent text-to-image models can generate photorealistic images from text prompts that
reflect detailed instructions, they still face significant challenges in accurately rendering …

Improving text generation on images with synthetic captions

J Koh, S Park, J Song - 2024 16th IIAI International Congress …, 2024‏ - ieeexplore.ieee.org
The recent emergence of latent diffusion models such as SDXL [1] and SD 1.5 [2] has shown
significant capability in generating highly detailed and realistic images. Despite their …

Typographic Attacks in a Multi-Image Setting

X Wang, Z Zhao, M Larson - arxiv preprint arxiv:2502.08193, 2025‏ - arxiv.org
Large Vision-Language Models (LVLMs) are susceptible to typographic attacks, which are
misclassifications caused by an attack text that is added to an image. In this paper, we …

Extract Free Dense Misalignment from CLIP

JY Nam, J Im, W Kim, T Kil - arxiv preprint arxiv:2412.18404, 2024‏ - arxiv.org
Recent vision-language foundation models still frequently produce outputs misaligned with
their inputs, evidenced by object hallucination in captioning and prompt misalignment in the …

Skip $\textbackslash n $: A simple method to reduce hallucination in Large Vision-Language Models

Z Han, Z Bai, H Mei, Q Xu, C Zhang… - ICLR 2024 Workshop on …, 2024‏ - openreview.net
Recent advancements in large vision-language models (LVLMs) have demonstrated
impressive capability in visual information understanding with human language. Despite …