- Academic Search

X Lu, Y Chen, C Chen, H Tan, B Chen, Y **e… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

The emergence and growing popularity of multimodal large language models (MLLMs) have
significant potential to enhance various aspects of daily life, from improving communication …‏

שמור צטט צוטט על ידי 5 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Skip\n: A simple method to reduce hallucination in large vision-language models‏

Z Han, Z Bai, H Mei, Q Xu, C Zhang… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Recent advancements in large vision-language models (LVLMs) have demonstrated
impressive capability in visual information understanding with human language. Despite …‏

שמור צטט צוטט על ידי 13 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning‏

AJ Wang, L Li, Y Lin, M Li, L Wang… - Advances in Neural …, 2025‏ - proceedings.neurips.cc‏

Training models with longer in-context lengths is a significant challenge for multimodal
machine learning due to substantial GPU memory and computational costs. This exploratory …‏

שמור צטט צוטט על ידי 1 מאמרים בנושא זה כל 4 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval‏

G Zeng, Y Zhang, J Wei, D Yang, P Zhang… - Proceedings of the …, 2024‏ - dl.acm.org‏

Scene text retrieval aims to find all images containing the query text from an image gallery.
Current efforts tend to adopt an Optical Character Recognition (OCR) pipeline, which …‏

שמור צטט צוטט על ידי 2 מאמרים בנושא זה כל 5 הגרסאות

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

DSDL: Data Set Description Language for Bridging Modalities and Tasks in AI Data‏

B Wang, L Ouyang, F Wu, W Ning, X Han… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

In the era of artificial intelligence, the diversity of data modalities and annotation formats
often renders data unusable directly, requiring understanding and format conversion before …‏

שמור צטט צוטט על ידי 1 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Type-R: Automatically Retouching Typos for Text-to-Image Generation‏

W Shimoda, N Inoue, D Haraguchi, H Mitani… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

While recent text-to-image models can generate photorealistic images from text prompts that
reflect detailed instructions, they still face significant challenges in accurately rendering …‏

שמור צטט מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

Improving text generation on images with synthetic captions‏

J Koh, S Park, J Song - 2024 16th IIAI International Congress …, 2024‏ - ieeexplore.ieee.org‏

The recent emergence of latent diffusion models such as SDXL [1] and SD 1.5 [2] has shown
significant capability in generating highly detailed and realistic images. Despite their …‏

שמור צטט צוטט על ידי 1 מאמרים בנושא זה כל 4 הגרסאות

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Typographic Attacks in a Multi-Image Setting‏

X Wang, Z Zhao, M Larson - arxiv preprint arxiv:2502.08193, 2025‏ - arxiv.org‏

Large Vision-Language Models (LVLMs) are susceptible to typographic attacks, which are
misclassifications caused by an attack text that is added to an image. In this paper, we …‏

שמור צטט מאמרים בנושא זה פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Extract Free Dense Misalignment from CLIP‏

JY Nam, J Im, W Kim, T Kil - arxiv preprint arxiv:2412.18404, 2024‏ - arxiv.org‏

Recent vision-language foundation models still frequently produce outputs misaligned with
their inputs, evidenced by object hallucination in captioning and prompt misalignment in the …‏

שמור צטט מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Skip $\textbackslash n $: A simple method to reduce hallucination in Large Vision-Language Models‏

Z Han, Z Bai, H Mei, Q Xu, C Zhang… - ICLR 2024 Workshop on …, 2024‏ - openreview.net‏

Recent advancements in large vision-language models (LVLMs) have demonstrated
impressive capability in visual information understanding with human language. Despite …‏

שמור צטט מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

יצירת התראה

צטט

חיפוש מתקדם

נשמר בספרייה שלי

Parrot captions teach clip to spot text

Bluelm-v-3b: Algorithm and system co-design for multimodal large language models on mobile devices‏

Skip\n: A simple method to reduce hallucination in large vision-language models‏

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning‏

Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval‏

DSDL: Data Set Description Language for Bridging Modalities and Tasks in AI Data‏

Type-R: Automatically Retouching Typos for Text-to-Image Generation‏

Improving text generation on images with synthetic captions‏

Typographic Attacks in a Multi-Image Setting‏

Extract Free Dense Misalignment from CLIP‏

Skip $\textbackslash n $: A simple method to reduce hallucination in Large Vision-Language Models‏