SEED-Bench: Benchmarking Multimodal Large Language Models

B Li, Y Ge, Y Ge, G Wang, R Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Multimodal large language models (MLLMs) building upon the foundation of powerful large
language models (LLMs) have recently demonstrated exceptional capabilities in generating …

Capsfusion: Rethinking image-text data at scale

Q Yu, Q Sun, X Zhang, Y Cui, F Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Large multimodal models demonstrate remarkable generalist ability to perform diverse
multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute …

Smartedit: Exploring complex instruction-based image editing with multimodal large language models

Y Huang, L **e, X Wang, Z Yuan… - Proceedings of the …, 2024 - openaccess.thecvf.com
Current instruction-based image editing methods such as InstructPix2Pix often fail to
produce satisfactory results in complex scenarios due to their dependence on the simple …

Mini-gemini: Mining the potential of multi-modality vision language models

Y Li, Y Zhang, C Wang, Z Zhong, Y Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-
modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating …

The (r) evolution of multimodal large language models: A survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arxiv preprint arxiv …, 2024 - arxiv.org
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

Uniir: Training and benchmarking universal multimodal information retrievers

C Wei, Y Chen, H Chen, H Hu, G Zhang, J Fu… - … on Computer Vision, 2024 - Springer
Existing information retrieval (IR) models often assume a homogeneous format, limiting their
applicability to diverse user needs, such as searching for images with text descriptions …

Lmms-eval: Reality check on the evaluation of large multimodal models

K Zhang, B Li, P Zhang, F Pu, JA Cahyono… - arxiv preprint arxiv …, 2024 - arxiv.org
The advances of large foundation models necessitate wide-coverage, low-cost, and zero-
contamination benchmarks. Despite continuous exploration of language model evaluations …

Janus: Decoupling visual encoding for unified multimodal understanding and generation

C Wu, X Chen, Z Wu, Y Ma, X Liu, Z Pan, W Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we introduce Janus, an autoregressive framework that unifies multimodal
understanding and generation. Prior research often relies on a single visual encoder for …

What if the tv was off? examining counterfactual reasoning abilities of multi-modal language models

L Zhang, X Zhai, Z Zhao, Y Zong… - Proceedings of the …, 2024 - openaccess.thecvf.com
Counterfactual reasoning a fundamental aspect of human cognition involves contemplating
alternatives to established facts or past events significantly enhancing our abilities in …

Llama pro: Progressive llama with block expansion

C Wu, Y Gan, Y Ge, Z Lu, J Wang, Y Feng… - arxiv preprint arxiv …, 2024 - arxiv.org
Humans generally acquire new skills without compromising the old; however, the opposite
holds for Large Language Models (LLMs), eg, from LLaMA to CodeLLaMA. To this end, we …