A comprehensive overview of large language models

H Naveed, AU Khan, S Qiu, M Saqib, S Anwar… - arxiv preprint arxiv …, 2023 - arxiv.org
Large Language Models (LLMs) have recently demonstrated remarkable capabilities in
natural language processing tasks and beyond. This success of LLMs has led to a large …

[HTML][HTML] A survey of GPT-3 family large language models including ChatGPT and GPT-4

KS Kalyan - Natural Language Processing Journal, 2024 - Elsevier
Large language models (LLMs) are a special class of pretrained language models (PLMs)
obtained by scaling model size, pretraining corpus and computation. LLMs, because of their …

A survey on multimodal large language models

S Yin, C Fu, S Zhao, K Li, X Sun, T Xu… - arxiv preprint arxiv …, 2023 - arxiv.org
Multimodal Large Language Model (MLLM) recently has been a new rising research
hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform …

A Survey of Multimodel Large Language Models

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org
With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

Pengi: An audio language model for audio tasks

S Deshmukh, B Elizalde, R Singh… - Advances in Neural …, 2023 - proceedings.neurips.cc
In the domain of audio processing, Transfer Learning has facilitated the rise of Self-
Supervised Learning and Zero-Shot Learning techniques. These approaches have led to …

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

S Chen, H Li, Q Wang, Z Zhao… - Advances in Neural …, 2024 - proceedings.neurips.cc
Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …

Audioldm 2: Learning holistic audio generation with self-supervised pretraining

H Liu, Y Yuan, X Liu, X Mei, Q Kong… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Although audio generation shares commonalities across different types of audio, such as
speech, music, and sound effects, designing models for each type requires careful …

Uni-moe: Scaling unified multimodal llms with mixture of experts

Y Li, S Jiang, B Hu, L Wang, W Zhong… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org
Recent advancements in Multimodal Large Language Models (MLLMs) underscore the
significance of scalable models and data to boost performance, yet this often incurs …

Separate anything you describe

X Liu, Q Kong, Y Zhao, H Liu, Y Yuan… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Language-queried audio source separation (LASS) is a new paradigm for computational
auditory scene analysis (CASA). LASS aims to separate a target sound from an audio …

Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios

Q Ye, Z Yu, R Shao, X **e, P Torr, X Cao - European Conference on …, 2024 - Springer
This paper focuses on the challenge of answering questions in scenarios that are composed
of rich and complex dynamic audio-visual components. Although existing Multimodal Large …