Llava-onevision: Easy visual task transfer
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed
by consolidating our insights into data, models, and visual representations in the LLaVA …
by consolidating our insights into data, models, and visual representations in the LLaVA …
Minicpm-v: A gpt-4v level mllm on your phone
The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally
reshaped the landscape of AI research and industry, shedding light on a promising path …
reshaped the landscape of AI research and industry, shedding light on a promising path …
[HTML][HTML] A survey of robot intelligence with large language models
Since the emergence of ChatGPT, research on large language models (LLMs) has actively
progressed across various fields. LLMs, pre-trained on vast text datasets, have exhibited …
progressed across various fields. LLMs, pre-trained on vast text datasets, have exhibited …
Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models
Today's most advanced multimodal models remain proprietary. The strongest open-weight
models rely heavily on synthetic data from proprietary VLMs to achieve good performance …
models rely heavily on synthetic data from proprietary VLMs to achieve good performance …
Eagle: Exploring the design space for multimodal llms with mixture of encoders
The ability to accurately interpret complex visual information is a crucial topic of multimodal
large language models (MLLMs). Recent work indicates that enhanced visual perception …
large language models (MLLMs). Recent work indicates that enhanced visual perception …
Clinical insights: A comprehensive review of language models in medicine
This paper provides a detailed examination of the advancements and applications of large
language models in the healthcare sector, with a particular emphasis on clinical …
language models in the healthcare sector, with a particular emphasis on clinical …
: A Vision-Language-Action Flow Model for General Robot Control
Robot learning holds tremendous promise to unlock the full potential of flexible, general, and
dexterous robot systems, as well as to address some of the deepest questions in artificial …
dexterous robot systems, as well as to address some of the deepest questions in artificial …
Benchmarking vision language models for cultural understanding
Foundation models and vision-language pre-training have notably advanced Vision
Language Models (VLMs), enabling multimodal processing of visual and linguistic data …
Language Models (VLMs), enabling multimodal processing of visual and linguistic data …
Robotic control via embodied chain-of-thought reasoning
A key limitation of learned robot control policies is their inability to generalize outside their
training data. Recent works on vision-language-action models (VLAs) have shown that the …
training data. Recent works on vision-language-action models (VLAs) have shown that the …
Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning
We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …
to enhance capabilities in text-rich image understanding, visual referring and grounding …