A comprehensive review of multimodal large language models: Performance and challenges across different tasks

J Wang, H Jiang, Y Liu, C Ma, X Zhang, Y Pan… - arxiv preprint arxiv …, 2024 - arxiv.org
In an era defined by the explosive growth of data and rapid technological advancements,
Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence …

Audiolm: a language modeling approach to audio generation

Z Borsos, R Marinier, D Vincent… - … ACM transactions on …, 2023 - ieeexplore.ieee.org
We introduce AudioLM, a framework for high-quality audio generation with long-term
consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts …

Audiogpt: Understanding and generating speech, music, sound, and talking head

R Huang, M Li, D Yang, J Shi, X Chang, Z Ye… - Proceedings of the …, 2024 - ojs.aaai.org
Large language models (LLMs) have exhibited remarkable capabilities across a variety of
domains and tasks, challenging our understanding of learning and cognition. Despite the …

Make-a-voice: Unified voice synthesis with discrete representation

R Huang, C Zhang, Y Wang, D Yang, L Liu… - arxiv preprint arxiv …, 2023 - arxiv.org
Various applications of voice synthesis have been developed independently despite the fact
that they generate" voice" as output in common. In addition, the majority of voice synthesis …

Wavchat: A survey of spoken dialogue models

S Ji, Y Chen, M Fang, J Zuo, J Lu, H Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …

Are discrete units necessary for spoken language modeling?

TA Nguyen, B Sagot, E Dupoux - IEEE Journal of Selected …, 2022 - ieeexplore.ieee.org
Recent work in spoken language modeling shows the possibility of learning a language
unsupervisedly from raw audio without any text labels. The approach relies first on …

Speechprompt: Prompting speech language models for speech processing tasks

KW Chang, H Wu, YK Wang, YK Wu… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Prompting has become a practical method for utilizing pre-trained language models (LMs).
This approach offers several advantages. It allows an LM to adapt to new tasks with minimal …

Disentangling prosody representations with unsupervised speech reconstruction

L Qu, T Li, C Weber, T Pekarek-Rosin… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
Human speech can be characterized by different components, including semantic content,
speaker identity and prosodic information. Significant progress has been made in …

Paralinguistic privacy protection at the edge

R Aloufi, H Haddadi, D Boyle - ACM Transactions on Privacy and …, 2023 - dl.acm.org
Voice user interfaces and digital assistants are rapidly entering our lives and becoming
singular touch points spanning our devices. These always-on services capture and transmit …

Evolutionary Retrofitting

M Videau, M Zameshina, A Leite, L Najman… - arxiv preprint arxiv …, 2024 - arxiv.org
AfterLearnER (After Learning Evolutionary Retrofitting) consists in applying non-
differentiable optimization, including evolutionary methods, to refine fully-trained machine …