الباحث العلمي من Google

M Deitke, C Clark, S Lee, R Tripathi, Y Yang… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Today's most advanced multimodal models remain proprietary. The strongest open-weight
models rely heavily on synthetic data from proprietary VLMs to achieve good performance …‏

حفظ اقتباس تم اقتباسها في عدد: 37 مقالات ذات صلة الإصدارات الـ 4كلها إصدار HTML‏

[Free GPT-4]

[PDF] arxiv.org

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling‏

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …‏

حفظ اقتباس تم اقتباسها في عدد: 25 مقالات ذات صلة الإصدارات الـ 2كلها إصدار HTML‏

[Free GPT-4]

[PDF] arxiv.org

Lvbench: An extreme long video understanding benchmark‏

W Wang, Z He, W Hong, Y Cheng, X Zhang, J Qi… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Recent progress in multimodal large language models has markedly enhanced the
understanding of short videos (typically under one minute), and several evaluation datasets …‏

حفظ اقتباس تم اقتباسها في عدد: 28 مقالات ذات صلة الإصدارات الـ 2كلها إصدار HTML‏

[Free GPT-4]

[PDF] arxiv.org

Benchmark evaluations, applications, and challenges of large vision language models: A survey‏

Z Li, X Wu, H Du, H Nghiem, G Shi - arxiv preprint arxiv:2501.02189, 2025‏ - arxiv.org‏

Multimodal Vision Language Models (VLMs) have emerged as a transformative technology
at the intersection of computer vision and natural language processing, enabling machines …‏

حفظ اقتباس تم اقتباسها في عدد: 3 مقالات ذات صلة الإصدارات الـ 2كلها إصدار HTML‏

[Free GPT-4]

[PDF] arxiv.org

Hart: Efficient visual generation with hybrid autoregressive transformer‏

H Tang, Y Wu, S Yang, E **e, J Chen, J Chen… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual
generation model capable of directly generating 1024x1024 images, rivaling diffusion …‏

حفظ اقتباس تم اقتباسها في عدد: 16 مقالات ذات صلة الإصدارات الـ 2كلها إصدار HTML‏

[Free GPT-4]

[PDF] arxiv.org

Llava-critic: Learning to evaluate multimodal models‏

T **ong, X Wang, D Guo, Q Ye, H Fan, Q Gu… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as
a generalist evaluator to assess performance across a wide range of multimodal tasks …‏

حفظ اقتباس تم اقتباسها في عدد: 17 مقالات ذات صلة الإصدارات الـ 3كلها إصدار HTML‏

[Free GPT-4]

[PDF] arxiv.org

Mora: Enabling generalist video generation via a multi-agent framework‏

Z Yuan, Y Liu, Y Cao, W Sun, H Jia, R Chen… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Text-to-video generation has made significant strides, but replicating the capabilities of
advanced systems like OpenAI Sora remains challenging due to their closed-source nature …‏

حفظ اقتباس تم اقتباسها في عدد: 19 مقالات ذات صلة الإصدارات الـ 2كلها إصدار HTML‏

[Free GPT-4]

[PDF] arxiv.org

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey‏

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …‏

حفظ اقتباس تم اقتباسها في عدد: 2 مقالات ذات صلة الإصدارات الـ 2كلها إصدار HTML‏

[Free GPT-4]

[PDF] arxiv.org

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding‏

Z Wu, X Chen, Z Pan, X Liu, W Liu, D Dai… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …‏

حفظ اقتباس تم اقتباسها في عدد: 8 مقالات ذات صلة الإصدارات الـ 3كلها إصدار HTML‏

[Free GPT-4]

[PDF] springer.com

Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance‏

Z Gao, Z Chen, E Cui, Y Ren, W Wang, J Zhu, H Tian… - Visual Intelligence, 2024‏ - Springer‏

Multi-modal large language models (MLLMs) have demonstrated impressive performance in
vision-language tasks across a wide range of domains. However, the large model scale and …‏

حفظ اقتباس تم اقتباسها في عدد: 8 مقالات ذات صلة الإصدارات الـ 3كلها

اقتباس

بحث متقدم

تم حفظ المقالة في مكتبتي.

Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models‏

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling‏

Lvbench: An extreme long video understanding benchmark‏

Benchmark evaluations, applications, and challenges of large vision language models: A survey‏

Hart: Efficient visual generation with hybrid autoregressive transformer‏

Llava-critic: Learning to evaluate multimodal models‏

Mora: Enabling generalist video generation via a multi-agent framework‏

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey‏

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding‏

Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance‏