MM1: methods, analysis and insights from multimodal LLM pre-training

B McKinzie, Z Gan, JP Fauconnier, S Dodge… - … on Computer Vision, 2024 - Springer
In this work, we discuss building performant Multimodal Large Language Models (MLLMs).
In particular, we study the importance of various architecture components and data choices …

Llm inference unveiled: Survey and roofline model insights

Z Yuan, Y Shang, Y Zhou, Z Dong, Z Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a
unique blend of opportunities and challenges. Although the field has expanded and is …

Are We on the Right Way for Evaluating Large Vision-Language Models?

L Chen, J Li, X Dong, P Zhang, Y Zang, Z Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking
numerous studies to evaluate their multi-modal capabilities. However, we dig into current …

Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning

H Zhang, M Gao, Z Gan, P Dufter, N Wenzel… - arxiv preprint arxiv …, 2024 - arxiv.org
We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …

Llava-mod: Making llava tiny via moe knowledge distillation

F Shu, Y Liao, L Zhuo, C Xu, L Zhang, G Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce LLaVA-MoD, a novel framework designed to enable the efficient training of
small-scale Multimodal Language Models (s-MLLM) by distilling knowledge from large-scale …

Rethinking visual prompting for multimodal large language models with external knowledge

Y Lin, Y Li, D Chen, W Xu, R Clark, P Torr… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, multimodal large language models (MLLMs) have made significant strides
by training on vast high-quality image-text datasets, enabling them to generally understand …

Dopra: Decoding over-accumulation penalization and re-allocation in specific weighting layer

J Wei, X Zhang - Proceedings of the 32nd ACM International …, 2024 - dl.acm.org
In this work, we introduce DOPRA, a novel approach designed to mitigate hallucinations in
multi-modal large language models (MLLMs). Unlike existing solutions that typically involve …

Sharegpt4video: Improving video understanding and generation with better captions

L Chen, X Wei, J Li, X Dong, P Zhang, Y Zang… - arxiv preprint arxiv …, 2024 - arxiv.org
We present the ShareGPT4Video series, aiming to facilitate the video understanding of large
video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) …

Tinychart: Efficient chart understanding with visual token merging and program-of-thoughts learning

L Zhang, A Hu, H Xu, M Yan, Y Xu, Q **… - arxiv preprint arxiv …, 2024 - arxiv.org
Charts are important for presenting and explaining complex data relationships. Recently,
multimodal large language models (MLLMs) have shown remarkable capabilities in various …

Automated multi-level preference for mllms

M Zhang, W Wu, Y Lu, Y Song, K Rong, H Yao… - arxiv preprint arxiv …, 2024 - arxiv.org
Current multimodal Large Language Models (MLLMs) suffer from``hallucination'',
occasionally generating responses that are not grounded in the input images. To tackle this …