SEED-Bench: Benchmarking Multimodal Large Language Models
Multimodal large language models (MLLMs) building upon the foundation of powerful large
language models (LLMs) have recently demonstrated exceptional capabilities in generating …
language models (LLMs) have recently demonstrated exceptional capabilities in generating …
Capsfusion: Rethinking image-text data at scale
Large multimodal models demonstrate remarkable generalist ability to perform diverse
multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute …
multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute …
Smartedit: Exploring complex instruction-based image editing with multimodal large language models
Current instruction-based image editing methods such as InstructPix2Pix often fail to
produce satisfactory results in complex scenarios due to their dependence on the simple …
produce satisfactory results in complex scenarios due to their dependence on the simple …
Mini-gemini: Mining the potential of multi-modality vision language models
In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-
modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating …
modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating …
The (r) evolution of multimodal large language models: A survey
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …
this reason, inspired by the success of large language models, significant research efforts …
Uniir: Training and benchmarking universal multimodal information retrievers
Existing information retrieval (IR) models often assume a homogeneous format, limiting their
applicability to diverse user needs, such as searching for images with text descriptions …
applicability to diverse user needs, such as searching for images with text descriptions …
Lmms-eval: Reality check on the evaluation of large multimodal models
The advances of large foundation models necessitate wide-coverage, low-cost, and zero-
contamination benchmarks. Despite continuous exploration of language model evaluations …
contamination benchmarks. Despite continuous exploration of language model evaluations …
Janus: Decoupling visual encoding for unified multimodal understanding and generation
In this paper, we introduce Janus, an autoregressive framework that unifies multimodal
understanding and generation. Prior research often relies on a single visual encoder for …
understanding and generation. Prior research often relies on a single visual encoder for …
What if the tv was off? examining counterfactual reasoning abilities of multi-modal language models
Counterfactual reasoning a fundamental aspect of human cognition involves contemplating
alternatives to established facts or past events significantly enhancing our abilities in …
alternatives to established facts or past events significantly enhancing our abilities in …
Llama pro: Progressive llama with block expansion
Humans generally acquire new skills without compromising the old; however, the opposite
holds for Large Language Models (LLMs), eg, from LLaMA to CodeLLaMA. To this end, we …
holds for Large Language Models (LLMs), eg, from LLaMA to CodeLLaMA. To this end, we …