Sharegpt4v: Improving large multi-modal models with better captions

L Chen, J Li, X Dong, P Zhang, C He, J Wang… - … on Computer Vision, 2024 - Springer
Modality alignment serves as the cornerstone for large multi-modal models (LMMs).
However, the impact of different attributes (eg, data type, quality, and scale) of training data …

Llava-onevision: Easy visual task transfer

B Li, Y Zhang, D Guo, R Zhang, F Li, H Zhang… - ar** Large
Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a …

Vision language models are blind

P Rahmanzadehgervi, L Bolton… - Proceedings of the …, 2024 - openaccess.thecvf.com
Large language models (LLMs) with vision capabilities (eg, GPT-4o, Gemini 1.5, and Claude
3) are powering countless image-text processing applications, enabling unprecedented …

Math-llava: Bootstrap** mathematical reasoning for multimodal large language models

W Shi, Z Hu, Y Bin, J Liu, Y Yang, SK Ng, L Bing… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) have demonstrated impressive reasoning capabilities,
particularly in textual mathematical problem-solving. However, existing open-source image …

Mme-survey: A comprehensive survey on evaluation of multimodal llms

C Fu, YF Zhang, S Yin, B Li, X Fang, S Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …