MM1: methods, analysis and insights from multimodal LLM pre-training
In this work, we discuss building performant Multimodal Large Language Models (MLLMs).
In particular, we study the importance of various architecture components and data choices …
In particular, we study the importance of various architecture components and data choices …
Llm inference unveiled: Survey and roofline model insights
The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a
unique blend of opportunities and challenges. Although the field has expanded and is …
unique blend of opportunities and challenges. Although the field has expanded and is …
Are We on the Right Way for Evaluating Large Vision-Language Models?
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking
numerous studies to evaluate their multi-modal capabilities. However, we dig into current …
numerous studies to evaluate their multi-modal capabilities. However, we dig into current …
Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning
We present MM1. 5, a new family of multimodal large language models (MLLMs) designed
to enhance capabilities in text-rich image understanding, visual referring and grounding …
to enhance capabilities in text-rich image understanding, visual referring and grounding …
Llava-mod: Making llava tiny via moe knowledge distillation
We introduce LLaVA-MoD, a novel framework designed to enable the efficient training of
small-scale Multimodal Language Models (s-MLLM) by distilling knowledge from large-scale …
small-scale Multimodal Language Models (s-MLLM) by distilling knowledge from large-scale …
Rethinking visual prompting for multimodal large language models with external knowledge
In recent years, multimodal large language models (MLLMs) have made significant strides
by training on vast high-quality image-text datasets, enabling them to generally understand …
by training on vast high-quality image-text datasets, enabling them to generally understand …
Dopra: Decoding over-accumulation penalization and re-allocation in specific weighting layer
In this work, we introduce DOPRA, a novel approach designed to mitigate hallucinations in
multi-modal large language models (MLLMs). Unlike existing solutions that typically involve …
multi-modal large language models (MLLMs). Unlike existing solutions that typically involve …
Sharegpt4video: Improving video understanding and generation with better captions
We present the ShareGPT4Video series, aiming to facilitate the video understanding of large
video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) …
video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) …
Tinychart: Efficient chart understanding with visual token merging and program-of-thoughts learning
Charts are important for presenting and explaining complex data relationships. Recently,
multimodal large language models (MLLMs) have shown remarkable capabilities in various …
multimodal large language models (MLLMs) have shown remarkable capabilities in various …
Automated multi-level preference for mllms
Current multimodal Large Language Models (MLLMs) suffer from``hallucination'',
occasionally generating responses that are not grounded in the input images. To tackle this …
occasionally generating responses that are not grounded in the input images. To tackle this …