- Academic Search

B Li, Y Zhang, D Guo, R Zhang, F Li, H Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed
by consolidating our insights into data, models, and visual representations in the LLaVA …

保存引用被引用次数：250 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Minicpm-v: A gpt-4v level mllm on your phone

Y Yao, T Yu, A Zhang, C Wang, J Cui, H Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org

The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally
reshaped the landscape of AI research and industry, shedding light on a promising path …

保存引用被引用次数：182 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[HTML] mdpi.com

[HTML][HTML] A survey of robot intelligence with large language models

H Jeong, H Lee, C Kim, S Shin - Applied Sciences, 2024 - mdpi.com

Since the emergence of ChatGPT, research on large language models (LLMs) has actively
progressed across various fields. LLMs, pre-trained on vast text datasets, have exhibited …

保存引用被引用次数：8 相关文章所有 2 个版本网页快照

[Free GPT-4]

[PDF] arxiv.org

Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models

M Deitke, C Clark, S Lee, R Tripathi, Y Yang… - arxiv preprint arxiv …, 2024 - arxiv.org

Today's most advanced multimodal models remain proprietary. The strongest open-weight
models rely heavily on synthetic data from proprietary VLMs to achieve good performance …

保存引用被引用次数：37 相关文章所有 4 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Eagle: Exploring the design space for multimodal llms with mixture of encoders

M Shi, F Liu, S Wang, S Liao, S Radhakrishnan… - arxiv preprint arxiv …, 2024 - arxiv.org

The ability to accurately interpret complex visual information is a crucial topic of multimodal
large language models (MLLMs). Recent work indicates that enhanced visual perception …

保存引用被引用次数：42 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Clinical insights: A comprehensive review of language models in medicine

N Neveditsin, P Lingras, V Mago - arxiv preprint arxiv:2408.11735, 2024 - arxiv.org

This paper provides a detailed examination of the advancements and applications of large
language models in the healthcare sector, with a particular emphasis on clinical …

保存引用被引用次数：4 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

: A Vision-Language-Action Flow Model for General Robot Control

K Black, N Brown, D Driess, A Esmail, M Equi… - arxiv preprint arxiv …, 2024 - arxiv.org

Robot learning holds tremendous promise to unlock the full potential of flexible, general, and
dexterous robot systems, as well as to address some of the deepest questions in artificial …

保存引用被引用次数：15 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Benchmarking vision language models for cultural understanding

S Nayak, K Jain, R Awal, S Reddy… - arxiv preprint arxiv …, 2024 - arxiv.org

Foundation models and vision-language pre-training have notably advanced Vision
Language Models (VLMs), enabling multimodal processing of visual and linguistic data …

保存引用被引用次数：12 相关文章所有 4 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Robotic control via embodied chain-of-thought reasoning

M Zawalski, W Chen, K Pertsch, O Mees, C Finn… - arxiv preprint arxiv …, 2024 - arxiv.org

A key limitation of learned robot control policies is their inability to generalize outside their
training data. Recent works on vision-language-action models (VLAs) have shown that the …

保存引用被引用次数：19 相关文章所有 6 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning

H Zhang, M Gao, Z Gan, P Dufter, N Wenzel… - arxiv preprint arxiv …, 2024 - arxiv.org

We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …

保存引用被引用次数：15 相关文章所有 3 个版本 HTML 版

引用

高级搜索

已保存到“我的图书馆”

Llava-onevision: Easy visual task transfer

Minicpm-v: A gpt-4v level mllm on your phone

[HTML][HTML] A survey of robot intelligence with large language models

Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models

Eagle: Exploring the design space for multimodal llms with mixture of encoders

Clinical insights: A comprehensive review of language models in medicine

: A Vision-Language-Action Flow Model for General Robot Control

Benchmarking vision language models for cultural understanding

Robotic control via embodied chain-of-thought reasoning

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning