- Academic Search

P Tong, E Brown, P Wu, S Woo… - Advances in …, 2025‏ - proceedings.neurips.cc‏

We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …‏

שמור צטט צוטט על ידי 211 מאמרים בנושא זה כל 5 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

What matters when building vision-language models?‏

H Laurençon, L Tronchon, M Cord… - Advances in Neural …, 2025‏ - proceedings.neurips.cc‏

The growing interest in vision-language models (VLMs) has been driven by improvements in
large language models and vision transformers. Despite the abundance of literature on this …‏

שמור צטט צוטט על ידי 164 מאמרים בנושא זה כל 4 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output‏

P Zhang, X Dong, Y Zang, Y Cao, R Qian… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …‏

שמור צטט צוטט על ידי 84 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling‏

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …‏

שמור צטט צוטט על ידי 42 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Nvlm: Open frontier-class multimodal llms‏

W Dai, N Lee, B Wang, Z Yang, Z Liu, J Barker… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs)
that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary …‏

שמור צטט צוטט על ידי 25 מאמרים בנושא זה כל 4 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning‏

H Zhang, M Gao, Z Gan, P Dufter, N Wenzel… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …‏

שמור צטט צוטט על ידי 17 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Visualagentbench: Towards large multimodal models as visual foundation agents‏

X Liu, T Zhang, Y Gu, IL Iong, Y Xu, X Song… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence,
merging capabilities in both language and vision to form highly capable Visual Foundation …‏

שמור צטט צוטט על ידי 8 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Automatically generating UI code from screenshot: A divide-and-conquer-based approach‏

Y Wan, C Wang, Y Dong, W Wang, S Li, Y Huo… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Websites are critical in today's digital world, with over 1.11 billion currently active and
approximately 252,000 new sites launched daily. Converting website layout design into …‏

שמור צטט צוטט על ידי 11 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Omchat: A recipe to train multimodal language models with strong long context and video understanding‏

T Zhao, Q Zhang, K Lee, P Liu, L Zhang, C Fang… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

We introduce OmChat, a model designed to excel in handling long contexts and video
understanding tasks. OmChat's new architecture standardizes how different visual inputs are …‏

שמור צטט צוטט על ידי 6 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks‏

F Zhang, L Wu, H Bai, G Lin, X Li, X Yu, Y Wang… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Coding tasks have been valuable for evaluating Large Language Models (LLMs), as they
demand the comprehension of high-level instructions, complex reasoning, and the …‏

שמור צטט צוטט על ידי 1 מאמרים בנושא זה כל 4 הגרסאות פתיחה בתור HTML

יצירת התראה

צטט

חיפוש מתקדם

נשמר בספרייה שלי

Unlocking the conversion of web screenshots into html code with the websight dataset

Cambrian-1: A fully open, vision-centric exploration of multimodal llms‏

What matters when building vision-language models?‏

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output‏

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling‏

Nvlm: Open frontier-class multimodal llms‏

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning‏

Visualagentbench: Towards large multimodal models as visual foundation agents‏

Automatically generating UI code from screenshot: A divide-and-conquer-based approach‏

Omchat: A recipe to train multimodal language models with strong long context and video understanding‏

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks‏