Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Large language models and causal inference in collaboration: A comprehensive survey

X Liu, P Xu, J Wu, J Yuan, Y Yang, Y Zhou, F Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Causal inference has shown potential in enhancing the predictive accuracy, fairness,
robustness, and explainability of Natural Language Processing (NLP) models by capturing …

Mmbench: Is your multi-modal model an all-around player?

Y Liu, H Duan, Y Zhang, B Li, S Zhang, W Zhao… - European conference on …, 2024 - Springer
Large vision-language models (VLMs) have recently achieved remarkable progress,
exhibiting impressive multimodal perception and reasoning abilities. However, effectively …

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer
In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

X Yue, Y Ni, K Zhang, T Zheng, R Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …

MM1: methods, analysis and insights from multimodal LLM pre-training

B McKinzie, Z Gan, JP Fauconnier, S Dodge… - … on Computer Vision, 2024 - Springer
In this work, we discuss building performant Multimodal Large Language Models (MLLMs).
In particular, we study the importance of various architecture components and data choices …

Cogvlm: Visual expert for pretrained language models

W Wang, Q Lv, W Yu, W Hong, J Qi, Y Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
We introduce CogVLM, a powerful open-source visual language foundation model. Different
from the popular shallow alignment method which maps image features into the input space …

Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images

Z Guo, R Xu, Y Yao, J Cui, Z Ni, C Ge, TS Chua… - … on Computer Vision, 2024 - Springer
Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding
the visual world. Conventional LMMs process images in fixed sizes and limited resolutions …

Woodpecker: Hallucination correction for multimodal large language models

S Yin, C Fu, S Zhao, T Xu, H Wang, D Sui… - Science China …, 2024 - Springer
Hallucinations is a big shadow hanging over the rapidly evolving multimodal large language
models (MLLMs), referring to that the generated text is inconsistent with the image content …

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …