- Academic Search

Y Dang, K Huang, J Huo, Y Yan, S Huang, D Liu… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with
large language models (LLMs) and computer vision (CV) systems driving advancements in …‏

שמור צטט צוטט על ידי 11 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling‏

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …‏

שמור צטט צוטט על ידי 42 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning‏

H Zhang, M Gao, Z Gan, P Dufter, N Wenzel… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …‏

שמור צטט צוטט על ידי 17 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] springer.com

Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance‏

Z Gao, Z Chen, E Cui, Y Ren, W Wang, J Zhu, H Tian… - Visual Intelligence, 2024‏ - Springer‏

Multi-modal large language models (MLLMs) have demonstrated impressive performance in
vision-language tasks across a wide range of domains. However, the large model scale and …‏

שמור צטט צוטט על ידי 9 מאמרים בנושא זה כל 3 הגרסאות

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

NVILA: Efficient frontier visual language models‏

Z Liu, L Zhu, B Shi, Z Zhang, Y Lou, S Yang… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Visual language models (VLMs) have made significant advances in accuracy in recent
years. However, their efficiency has received much less attention. This paper introduces …‏

שמור צטט צוטט על ידי 7 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Apollo: An exploration of video understanding in large multimodal models‏

O Zohar, X Wang, Y Dubois, N Mehta, T **ao… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …‏

שמור צטט צוטט על ידי 6 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Phantom of latent for large language and vision models‏

BK Lee, S Chung, CW Kim, B Park, YM Ro - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

The success of visual instruction tuning has accelerated the development of large language
and vision models (LLVMs). Following the scaling laws of instruction-tuned large language …‏

שמור צטט צוטט על ידי 5 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Scaling inference-time search with vision value model for improved visual comprehension‏

W **yao, Y Zhengyuan, L Linjie, L Hong**… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Despite significant advancements in vision-language models (VLMs), there lacks effective
approaches to enhance response quality by scaling inference-time computation. This …‏

שמור צטט צוטט על ידי 3 מאמרים בנושא זה כל 4 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Your mixture-of-experts llm is secretly an embedding model for free‏

Z Li, T Zhou - arxiv preprint arxiv:2410.10814, 2024‏ - arxiv.org‏

While large language models (LLMs) excel on generation tasks, their decoder-only
architecture often limits their potential as embedding models if no further representation …‏

שמור צטט צוטט על ידי 4 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Do language models understand time?‏

X Ding, L Wang - arxiv preprint arxiv:2412.13845, 2024‏ - arxiv.org‏

Large language models (LLMs) have revolutionized video-based computer vision
applications, including action recognition, anomaly detection, and video summarization …‏

שמור צטט צוטט על ידי 3 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

יצירת התראה

צטט

חיפוש מתקדם

נשמר בספרייה שלי

Eagle: Exploring the design space for multimodal llms with mixture of encoders

Explainable and interpretable multimodal large language models: A comprehensive survey‏

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling‏

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning‏

Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance‏

NVILA: Efficient frontier visual language models‏

Apollo: An exploration of video understanding in large multimodal models‏

Phantom of latent for large language and vision models‏

Scaling inference-time search with vision value model for improved visual comprehension‏

Your mixture-of-experts llm is secretly an embedding model for free‏

Do language models understand time?‏