Mova: Adapting mixture of vision experts to multimodal context

Z Zong, B Ma, D Shen, G Song, H Shao, D Jiang… - arxiv preprint arxiv …, 2024 - arxiv.org
As the key component in multimodal large language models (MLLMs), the ability of the
visual encoder greatly affects MLLM's understanding on diverse image content. Although …