- Academic Search

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Salva Cita Citato da 214 Articoli correlati Tutte e 2 le versioni Versione HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Video-llava: Learning united visual representation by alignment before projection

B Lin, Y Ye, B Zhu, J Cui, M Ning, P **… - arxiv preprint arxiv …, 2023 - arxiv.org

The Large Vision-Language Model (LVLM) has enhanced the performance of various
downstream tasks in visual-language understanding. Most existing approaches encode …

Salva Cita Citato da 439 Articoli correlati Tutte e 3 le versioni Versione HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Llamafactory: Unified efficient fine-tuning of 100+ language models

Y Zheng, R Zhang, J Zhang, Y Ye, Z Luo… - arxiv preprint arxiv …, 2024 - arxiv.org

Efficient fine-tuning is vital for adapting large language models (LLMs) to downstream tasks.
However, it requires non-trivial efforts to implement these methods on different models. We …

Salva Cita Citato da 270 Articoli correlati Tutte e 2 le versioni Versione HTML

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Salva Cita Citato da 119 Articoli correlati Tutte e 3 le versioni

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Z Chen, J Wu, W Wang, W Su, G Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com

The exponential growth of large language models (LLMs) has opened up numerous
possibilities for multi-modal AGI systems. However the progress in vision and vision …

Salva Cita Citato da 177 Articoli correlati Tutte e 4 le versioni Versione HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Chat-univi: Unified visual representation empowers large language models with image and video understanding

P **, R Takanobu, W Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Large language models have demonstrated impressive universal capabilities across a wide
range of open-ended tasks and have extended their utility to encompass multimodal …

Salva Cita Citato da 170 Articoli correlati Tutte e 4 le versioni Versione HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting

Y Wang, X Liu, Y Li, M Chen, C **ao - European Conference on Computer …, 2024 - Springer

With the advent and widespread deployment of Multimodal Large Language Models
(MLLMs), the imperative to ensure their safety has become increasingly pronounced …

Salva Cita Citato da 35 Articoli correlati Tutte e 2 le versioni

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Salva Cita Citato da 61 Articoli correlati Tutte e 2 le versioni Versione HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Worldgpt: Empowering llm as multimodal world model

Z Ge, H Huang, M Zhou, J Li, G Wang, S Tang… - Proceedings of the …, 2024 - dl.acm.org

World models are progressively being employed across diverse fields, extending from basic
environment simulation to complex scenario construction. However, existing models are …

Salva Cita Citato da 20 Articoli correlati Tutte e 2 le versioni

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

When large language model agents meet 6G networks: Perception, grounding, and alignment

M Xu, D Niyato, J Kang, Z **ong, S Mao… - IEEE Wireless …, 2024 - ieeexplore.ieee.org

AI agents based on multimodal large language models (LLMs) are expected to revolutionize
human-computer interaction, and offer more personalized assistant services across various …

Salva Cita Citato da 26 Articoli correlati Tutte e 2 le versioni

Crea avviso

Cita

Ricerca avanzata

Salvato in La mia biblioteca

Languagebind: Extending video-language pretraining to n-modality by language-based semantic...

Mm-llms: Recent advances in multimodal large language models

Video-llava: Learning united visual representation by alignment before projection

Llamafactory: Unified efficient fine-tuning of 100+ language models

Internvideo2: Scaling foundation models for multimodal video understanding

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Chat-univi: Unified visual representation empowers large language models with image and video understanding

Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting

Video understanding with large language models: A survey

Worldgpt: Empowering llm as multimodal world model

When large language model agents meet 6G networks: Perception, grounding, and alignment