- Academic Search

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …‏

שמור צטט צוטט על ידי 230 מאמרים בנושא זה כל 6 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities‏

MF Ishmam, MSH Shovon, MF Mridha, N Dey - Information Fusion, 2024‏ - Elsevier‏

The multimodal task of Visual Question Answering (VQA) encompassing elements of
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …‏

שמור צטט צוטט על ידי 27 מאמרים בנושא זה כל 3 הגרסאות

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Qwen technical report‏

J Bai, S Bai, Y Chu, Z Cui, K Dang, X Deng… - arxiv preprint arxiv …, 2023‏ - arxiv.org‏

Large language models (LLMs) have revolutionized the field of artificial intelligence,
enabling natural language processing tasks that were previously thought to be exclusive to …‏

שמור צטט צוטט על ידי 2530 מאמרים בנושא זה כל 6 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

The llama 3 herd of models‏

A Dubey, A Jauhri, A Pandey, A Kadian… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …‏

שמור צטט צוטט על ידי 3019 מאמרים בנושא זה כל 4 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Llama-adapter: Efficient fine-tuning of language models with zero-init attention‏

R Zhang, J Han, C Liu, P Gao, A Zhou, X Hu… - arxiv preprint arxiv …, 2023‏ - arxiv.org‏

We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA
into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter …‏

שמור צטט צוטט על ידי 746 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Mvbench: A comprehensive multi-modal video understanding benchmark‏

K Li, Y Wang, Y He, Y Li, Y Wang… - Proceedings of the …, 2024‏ - openaccess.thecvf.com‏

With the rapid development of Multi-modal Large Language Models (MLLMs) a number of
diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities …‏

שמור צטט צוטט על ידי 280 מאמרים בנושא זה כל 8 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Cambrian-1: A fully open, vision-centric exploration of multimodal llms‏

P Tong, E Brown, P Wu, S Woo… - Advances in …, 2025‏ - proceedings.neurips.cc‏

We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …‏

שמור צטט צוטט על ידי 211 מאמרים בנושא זה כל 5 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

What matters when building vision-language models?‏

H Laurençon, L Tronchon, M Cord… - Advances in Neural …, 2025‏ - proceedings.neurips.cc‏

The growing interest in vision-language models (VLMs) has been driven by improvements in
large language models and vision transformers. Despite the abundance of literature on this …‏

שמור צטט צוטט על ידי 164 מאמרים בנושא זה כל 4 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Cogagent: A visual language model for gui agents‏

W Hong, W Wang, Q Lv, J Xu, W Yu… - Proceedings of the …, 2024‏ - openaccess.thecvf.com‏

People are spending an enormous amount of time on digital devices through graphical user
interfaces (GUIs) eg computer or smartphone screens. Large language models (LLMs) such …‏

שמור צטט צוטט על ידי 266 מאמרים בנושא זה כל 6 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites‏

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024‏ - Springer‏

In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …‏

שמור צטט צוטט על ידי 394 מאמרים בנושא זה כל 4 הגרסאות

יצירת התראה

צטט

חיפוש מתקדם

נשמר בספרייה שלי

Docvqa: A dataset for vqa on document images

Mm-llms: Recent advances in multimodal large language models‏

From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities‏

Qwen technical report‏

The llama 3 herd of models‏

Llama-adapter: Efficient fine-tuning of language models with zero-init attention‏

Mvbench: A comprehensive multi-modal video understanding benchmark‏

Cambrian-1: A fully open, vision-centric exploration of multimodal llms‏

What matters when building vision-language models?‏

Cogagent: A visual language model for gui agents‏

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites‏