Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Video-llava: Learning united visual representation by alignment before projection

B Lin, Y Ye, B Zhu, J Cui, M Ning, P **… - arxiv preprint arxiv …, 2023 - arxiv.org
The Large Vision-Language Model (LVLM) has enhanced the performance of various
downstream tasks in visual-language understanding. Most existing approaches encode …

Llamafactory: Unified efficient fine-tuning of 100+ language models

Y Zheng, R Zhang, J Zhang, Y Ye, Z Luo… - arxiv preprint arxiv …, 2024 - arxiv.org
Efficient fine-tuning is vital for adapting large language models (LLMs) to downstream tasks.
However, it requires non-trivial efforts to implement these methods on different models. We …

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Z Chen, J Wu, W Wang, W Su, G Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
The exponential growth of large language models (LLMs) has opened up numerous
possibilities for multi-modal AGI systems. However the progress in vision and vision …

Chat-univi: Unified visual representation empowers large language models with image and video understanding

P **, R Takanobu, W Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Large language models have demonstrated impressive universal capabilities across a wide
range of open-ended tasks and have extended their utility to encompass multimodal …

Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting

Y Wang, X Liu, Y Li, M Chen, C **ao - European Conference on Computer …, 2024 - Springer
With the advent and widespread deployment of Multimodal Large Language Models
(MLLMs), the imperative to ensure their safety has become increasingly pronounced …

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Worldgpt: Empowering llm as multimodal world model

Z Ge, H Huang, M Zhou, J Li, G Wang, S Tang… - Proceedings of the …, 2024 - dl.acm.org
World models are progressively being employed across diverse fields, extending from basic
environment simulation to complex scenario construction. However, existing models are …

When large language model agents meet 6G networks: Perception, grounding, and alignment

M Xu, D Niyato, J Kang, Z **ong, S Mao… - IEEE Wireless …, 2024 - ieeexplore.ieee.org
AI agents based on multimodal large language models (LLMs) are expected to revolutionize
human-computer interaction, and offer more personalized assistant services across various …