- Academic Search

B Li, Y Ge, Y Ge, G Wang, R Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Multimodal large language models (MLLMs) building upon the foundation of powerful large
language models (LLMs) have recently demonstrated exceptional capabilities in generating …

Save Cite Cited by 123 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

Capsfusion: Rethinking image-text data at scale

Q Yu, Q Sun, X Zhang, Y Cui, F Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Large multimodal models demonstrate remarkable generalist ability to perform diverse
multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute …

Save Cite Cited by 43 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

Smartedit: Exploring complex instruction-based image editing with multimodal large language models

Y Huang, L **e, X Wang, Z Yuan… - Proceedings of the …, 2024 - openaccess.thecvf.com

Current instruction-based image editing methods such as InstructPix2Pix often fail to
produce satisfactory results in complex scenarios due to their dependence on the simple …

Save Cite Cited by 59 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Mini-gemini: Mining the potential of multi-modality vision language models

Y Li, Y Zhang, C Wang, Z Zhong, Y Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-
modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating …

Save Cite Cited by 182 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

The (r) evolution of multimodal large language models: A survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arxiv preprint arxiv …, 2024 - arxiv.org

Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

Save Cite Cited by 43 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Uniir: Training and benchmarking universal multimodal information retrievers

C Wei, Y Chen, H Chen, H Hu, G Zhang, J Fu… - … on Computer Vision, 2024 - Springer

Existing information retrieval (IR) models often assume a homogeneous format, limiting their
applicability to diverse user needs, such as searching for images with text descriptions …

Save Cite Cited by 30 Related articles All 3 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Lmms-eval: Reality check on the evaluation of large multimodal models

K Zhang, B Li, P Zhang, F Pu, JA Cahyono… - arxiv preprint arxiv …, 2024 - arxiv.org

The advances of large foundation models necessitate wide-coverage, low-cost, and zero-
contamination benchmarks. Despite continuous exploration of language model evaluations …

Save Cite Cited by 28 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Janus: Decoupling visual encoding for unified multimodal understanding and generation

C Wu, X Chen, Z Wu, Y Ma, X Liu, Z Pan, W Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

In this paper, we introduce Janus, an autoregressive framework that unifies multimodal
understanding and generation. Prior research often relies on a single visual encoder for …

Save Cite Cited by 21 Related articles View as HTML

[Free GPT-4]

[PDF] thecvf.com

What if the tv was off? examining counterfactual reasoning abilities of multi-modal language models

L Zhang, X Zhai, Z Zhao, Y Zong… - Proceedings of the …, 2024 - openaccess.thecvf.com

Counterfactual reasoning a fundamental aspect of human cognition involves contemplating
alternatives to established facts or past events significantly enhancing our abilities in …

Save Cite Cited by 18 Related articles All 7 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Llama pro: Progressive llama with block expansion

C Wu, Y Gan, Y Ge, Z Lu, J Wang, Y Feng… - arxiv preprint arxiv …, 2024 - arxiv.org

Humans generally acquire new skills without compromising the old; however, the opposite
holds for Large Language Models (LLMs), eg, from LLaMA to CodeLLaMA. To this end, we …

Save Cite Cited by 51 Related articles All 2 versions Free GPT-4 View as HTML

Create alert

Cite

Advanced search

Saved to My library

Making llama see and draw with seed tokenizer

SEED-Bench: Benchmarking Multimodal Large Language Models

Capsfusion: Rethinking image-text data at scale

Smartedit: Exploring complex instruction-based image editing with multimodal large language models

Mini-gemini: Mining the potential of multi-modality vision language models

The (r) evolution of multimodal large language models: A survey

Uniir: Training and benchmarking universal multimodal information retrievers

Lmms-eval: Reality check on the evaluation of large multimodal models

Janus: Decoupling visual encoding for unified multimodal understanding and generation

What if the tv was off? examining counterfactual reasoning abilities of multi-modal language models

Llama pro: Progressive llama with block expansion