- Academic Search

X Fu, Y Hu, B Li, Y Feng, H Wang, X Lin, D Roth… - … on Computer Vision, 2024 - Springer

We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses
on core visual perception abilities not found in other evaluations. Most of the Blink tasks can …

Zapisz Cytuj Cytowane przez 99 Powiązane artykuły Wszystkie wersje 2

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Brain-conditional multimodal synthesis: A survey and taxonomy

W Mai, J Zhang, P Fang, Z Zhang - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

In the era of Artificial Intelligence Generated Content (AIGC), conditional multimodal
synthesis technologies (eg, text-to-image) are dynamically resha** the natural content …

Zapisz Cytuj Cytowane przez 11 Powiązane artykuły Wszystkie wersje 2

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

BRAVE: Broadening the visual encoding of vision-language models

OF Kar, A Tonioni, P Poklukar, A Kulshrestha… - … on Computer Vision, 2024 - Springer

Vision-language models (VLMs) are typically composed of a vision encoder, eg CLIP, and a
language model (LM) that interprets the encoded features to solve downstream tasks …

Zapisz Cytuj Cytowane przez 29 Powiązane artykuły Wszystkie wersje 2

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Vila-u: a unified foundation model integrating visual understanding and generation

Y Wu, Z Zhang, J Chen, H Tang, D Li, Y Fang… - arxiv preprint arxiv …, 2024 - arxiv.org

VILA-U is a Unified foundation model that integrates Video, Image, Language understanding
and generation. Traditional visual language models (VLMs) use separate modules for …

Zapisz Cytuj Cytowane przez 37 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Rotary position embedding for vision transformer

B Heo, S Park, D Han, S Yun - European Conference on Computer Vision, 2024 - Springer

Abstract Rotary Position Embedding (RoPE) performs remarkably on language models,
especially for length extrapolation of Transformers. However, the impacts of RoPE on …

Zapisz Cytuj Cytowane przez 23 Powiązane artykuły Wszystkie wersje 2

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

The (r) evolution of multimodal large language models: A survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arxiv preprint arxiv …, 2024 - arxiv.org

Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

Zapisz Cytuj Cytowane przez 46 Powiązane artykuły Wszystkie wersje 4 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining

D Liu, S Zhao, L Zhuo, W Lin, Y Qiao, H Li… - arxiv preprint arxiv …, 2024 - arxiv.org

We present Lumina-mGPT, a family of multimodal autoregressive models capable of various
vision and language tasks, particularly excelling in generating flexible photorealistic images …

Zapisz Cytuj Cytowane przez 29 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Multimodal pretraining, adaptation, and generation for recommendation: A survey

Q Liu, J Zhu, Y Yang, Q Dai, Z Du, XM Wu… - Proceedings of the 30th …, 2024 - dl.acm.org

Personalized recommendation serves as a ubiquitous channel for users to discover
information tailored to their interests. However, traditional recommendation models primarily …

Zapisz Cytuj Cytowane przez 22 Powiązane artykuły Wszystkie wersje 2

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Worldgpt: Empowering llm as multimodal world model

Z Ge, H Huang, M Zhou, J Li, G Wang, S Tang… - Proceedings of the …, 2024 - dl.acm.org

World models are progressively being employed across diverse fields, extending from basic
environment simulation to complex scenario construction. However, existing models are …

Zapisz Cytuj Cytowane przez 20 Powiązane artykuły Wszystkie wersje 2

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Moma: Efficient early-fusion pre-training with mixture of modality-aware experts

XV Lin, A Shrivastava, L Luo, S Iyer, M Lewis… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce MoMa, a novel modality-aware mixture-of-experts (MoE) architecture designed
for pre-training mixed-modal, early-fusion language models. MoMa processes images and …

Zapisz Cytuj Cytowane przez 17 Powiązane artykuły Wszystkie wersje 4 Wersja HTML

Utwórz alert

Cytuj

Szukanie zaawansowane

Zapisano w Mojej bibliotece

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

Blink: Multimodal large language models can see but not perceive

Brain-conditional multimodal synthesis: A survey and taxonomy

BRAVE: Broadening the visual encoding of vision-language models

Vila-u: a unified foundation model integrating visual understanding and generation

Rotary position embedding for vision transformer

The (r) evolution of multimodal large language models: A survey

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining

Multimodal pretraining, adaptation, and generation for recommendation: A survey

Worldgpt: Empowering llm as multimodal world model

Moma: Efficient early-fusion pre-training with mixture of modality-aware experts