Llava-onevision: Easy visual task transfer

B Li, Y Zhang, D Guo, R Zhang, F Li, H Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed
by consolidating our insights into data, models, and visual representations in the LLaVA …

A comprehensive review of multimodal large language models: Performance and challenges across different tasks

J Wang, H Jiang, Y Liu, C Ma, X Zhang, Y Pan… - arxiv preprint arxiv …, 2024 - arxiv.org
In an era defined by the explosive growth of data and rapid technological advancements,
Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence …

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

S Tong, E Brown, P Wu, S Woo, M Middepogu… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

Minicpm-v: A gpt-4v level mllm on your phone

Y Yao, T Yu, A Zhang, C Wang, J Cui, H Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally
reshaped the landscape of AI research and industry, shedding light on a promising path …

Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models

F Li, R Zhang, H Zhang, Y Zhang, B Li, W Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Visual instruction tuning has made considerable strides in enhancing the capabilities of
Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single …

[HTML][HTML] A survey of robot intelligence with large language models

H Jeong, H Lee, C Kim, S Shin - Applied Sciences, 2024 - mdpi.com
Since the emergence of ChatGPT, research on large language models (LLMs) has actively
progressed across various fields. LLMs, pre-trained on vast text datasets, have exhibited …

Cumo: Scaling multimodal llm with co-upcycled mixture-of-experts

J Li, X Wang, S Zhu, CW Kuo, L Xu… - Advances in …, 2025 - proceedings.neurips.cc
Abstract Recent advancements in Multimodal Large Language Models (LLMs) have focused
primarily on scaling by increasing text-image pair data and enhancing LLMs to improve …

mplug-owl3: Towards long image-sequence understanding in multi-modal large language models

J Ye, H Xu, H Liu, A Hu, M Yan, Q Qian, J Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities
in executing instructions for a variety of single-image tasks. Despite this progress, significant …

Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution

Z Liu, Y Dong, Z Liu, W Hu, J Lu, Y Rao - arxiv preprint arxiv:2409.12961, 2024 - arxiv.org
Visual data comes in various forms, ranging from small icons of just a few pixels to long
videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual …

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation

X He, D Jiang, G Zhang, M Ku, A Soni, S Siu… - arxiv preprint arxiv …, 2024 - arxiv.org
The recent years have witnessed great advances in video generation. However, the
development of automatic video metrics is lagging significantly behind. None of the existing …