- Academic Search

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org

With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

保存引用被引用次数：153 相关文章所有 7 个版本

[Free GPT-4]

[PDF] arxiv.org

A comprehensive review of multimodal large language models: Performance and challenges across different tasks

J Wang, H Jiang, Y Liu, C Ma, X Zhang, Y Pan… - arxiv preprint arxiv …, 2024 - arxiv.org

In an era defined by the explosive growth of data and rapid technological advancements,
Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence …

保存引用被引用次数：21 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Mini-gemini: Mining the potential of multi-modality vision language models

Y Li, Y Zhang, C Wang, Z Zhong, Y Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-
modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating …

保存引用被引用次数：184 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Vita: Towards open-source interactive omni multimodal llm

C Fu, H Lin, Z Long, Y Shen, M Zhao, Y Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

The remarkable multimodal capabilities and interactive experience of GPT-4o underscore
their necessity in practical applications, yet open-source models rarely excel in both areas …

保存引用被引用次数：40 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Multimodal pretraining, adaptation, and generation for recommendation: A survey

Q Liu, J Zhu, Y Yang, Q Dai, Z Du, XM Wu… - Proceedings of the 30th …, 2024 - dl.acm.org

Personalized recommendation serves as a ubiquitous channel for users to discover
information tailored to their interests. However, traditional recommendation models primarily …

保存引用被引用次数：20 相关文章所有 2 个版本

[Free GPT-4]

[PDF] arxiv.org

Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling

S Ji, Z Jiang, W Wang, Y Chen, M Fang, J Zuo… - arxiv preprint arxiv …, 2024 - arxiv.org

Language models have been effectively applied to modeling natural signals, such as
images, video, speech, and audio. A crucial component of these models is the codec …

保存引用被引用次数：21 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Emova: Empowering language models to see, hear and speak with vivid emotions

K Chen, Y Gou, R Huang, Z Liu, D Tan, J Xu… - arxiv preprint arxiv …, 2024 - arxiv.org

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and
tones, marks a milestone for omni-modal foundation models. However, empowering Large …

保存引用被引用次数：12 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Foundation models for music: A survey

Y Ma, A Øland, A Ragni, BMS Del Sette, C Saitis… - arxiv preprint arxiv …, 2024 - arxiv.org

In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …

保存引用被引用次数：12 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Multimodal self-instruct: Synthetic abstract image and visual reasoning instruction using language model

W Zhang, Z Cheng, Y He, M Wang, Y Shen… - arxiv preprint arxiv …, 2024 - arxiv.org

Although most current large multimodal models (LMMs) can already understand photos of
natural scenes and portraits, their understanding of abstract images, eg, charts, maps, or …

保存引用被引用次数：10 相关文章所有 5 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Adanat: Exploring adaptive policy for token-based image generation

Z Ni, Y Wang, R Zhou, R Lu, J Guo, J Hu, Z Liu… - … on Computer Vision, 2024 - Springer

Recent studies have demonstrated the effectiveness of token-based methods for visual
content generation. As a representative work, non-autoregressive Transformers (NATs) are …

保存引用被引用次数：3 相关文章所有 5 个版本

创建快讯

引用

高级搜索

已保存到“我的图书馆”

Anygpt: Unified multimodal llm with discrete sequence modeling

A Survey of Multimodel Large Language Models

A comprehensive review of multimodal large language models: Performance and challenges across different tasks

Mini-gemini: Mining the potential of multi-modality vision language models

Vita: Towards open-source interactive omni multimodal llm

Multimodal pretraining, adaptation, and generation for recommendation: A survey

Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling

Emova: Empowering language models to see, hear and speak with vivid emotions

Foundation models for music: A survey

Multimodal self-instruct: Synthetic abstract image and visual reasoning instruction using language model

Adanat: Exploring adaptive policy for token-based image generation