Llava-onevision: Easy visual task transfer
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed
by consolidating our insights into data, models, and visual representations in the LLaVA …
by consolidating our insights into data, models, and visual representations in the LLaVA …
A comprehensive review of multimodal large language models: Performance and challenges across different tasks
In an era defined by the explosive growth of data and rapid technological advancements,
Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence …
Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence …
Cambrian-1: A fully open, vision-centric exploration of multimodal llms
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …
centric approach. While stronger language models can enhance multimodal capabilities, the …
Minicpm-v: A gpt-4v level mllm on your phone
The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally
reshaped the landscape of AI research and industry, shedding light on a promising path …
reshaped the landscape of AI research and industry, shedding light on a promising path …
Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models
Visual instruction tuning has made considerable strides in enhancing the capabilities of
Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single …
Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single …
[HTML][HTML] A survey of robot intelligence with large language models
Since the emergence of ChatGPT, research on large language models (LLMs) has actively
progressed across various fields. LLMs, pre-trained on vast text datasets, have exhibited …
progressed across various fields. LLMs, pre-trained on vast text datasets, have exhibited …
Cumo: Scaling multimodal llm with co-upcycled mixture-of-experts
Abstract Recent advancements in Multimodal Large Language Models (LLMs) have focused
primarily on scaling by increasing text-image pair data and enhancing LLMs to improve …
primarily on scaling by increasing text-image pair data and enhancing LLMs to improve …
mplug-owl3: Towards long image-sequence understanding in multi-modal large language models
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities
in executing instructions for a variety of single-image tasks. Despite this progress, significant …
in executing instructions for a variety of single-image tasks. Despite this progress, significant …
Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution
Visual data comes in various forms, ranging from small icons of just a few pixels to long
videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual …
videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual …
Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation
The recent years have witnessed great advances in video generation. However, the
development of automatic video metrics is lagging significantly behind. None of the existing …
development of automatic video metrics is lagging significantly behind. None of the existing …