Blink: Multimodal large language models can see but not perceive

X Fu, Y Hu, B Li, Y Feng, H Wang, X Lin, D Roth… - … on Computer Vision, 2024 - Springer
We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses
on core visual perception abilities not found in other evaluations. Most of the Blink tasks can …

Brain-conditional multimodal synthesis: A survey and taxonomy

W Mai, J Zhang, P Fang, Z Zhang - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
In the era of Artificial Intelligence Generated Content (AIGC), conditional multimodal
synthesis technologies (eg, text-to-image) are dynamically resha** the natural content …

BRAVE: Broadening the visual encoding of vision-language models

OF Kar, A Tonioni, P Poklukar, A Kulshrestha… - … on Computer Vision, 2024 - Springer
Vision-language models (VLMs) are typically composed of a vision encoder, eg CLIP, and a
language model (LM) that interprets the encoded features to solve downstream tasks …

Vila-u: a unified foundation model integrating visual understanding and generation

Y Wu, Z Zhang, J Chen, H Tang, D Li, Y Fang… - arxiv preprint arxiv …, 2024 - arxiv.org
VILA-U is a Unified foundation model that integrates Video, Image, Language understanding
and generation. Traditional visual language models (VLMs) use separate modules for …

Rotary position embedding for vision transformer

B Heo, S Park, D Han, S Yun - European Conference on Computer Vision, 2024 - Springer
Abstract Rotary Position Embedding (RoPE) performs remarkably on language models,
especially for length extrapolation of Transformers. However, the impacts of RoPE on …

The (r) evolution of multimodal large language models: A survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arxiv preprint arxiv …, 2024 - arxiv.org
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining

D Liu, S Zhao, L Zhuo, W Lin, Y Qiao, H Li… - arxiv preprint arxiv …, 2024 - arxiv.org
We present Lumina-mGPT, a family of multimodal autoregressive models capable of various
vision and language tasks, particularly excelling in generating flexible photorealistic images …

Multimodal pretraining, adaptation, and generation for recommendation: A survey

Q Liu, J Zhu, Y Yang, Q Dai, Z Du, XM Wu… - Proceedings of the 30th …, 2024 - dl.acm.org
Personalized recommendation serves as a ubiquitous channel for users to discover
information tailored to their interests. However, traditional recommendation models primarily …

Worldgpt: Empowering llm as multimodal world model

Z Ge, H Huang, M Zhou, J Li, G Wang, S Tang… - Proceedings of the …, 2024 - dl.acm.org
World models are progressively being employed across diverse fields, extending from basic
environment simulation to complex scenario construction. However, existing models are …

Moma: Efficient early-fusion pre-training with mixture of modality-aware experts

XV Lin, A Shrivastava, L Luo, S Iyer, M Lewis… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce MoMa, a novel modality-aware mixture-of-experts (MoE) architecture designed
for pre-training mixed-modal, early-fusion language models. MoMa processes images and …