Towards audio language modeling--an overview

H Wu, X Chen, YC Lin, K Chang, HL Chung… - arxiv preprint arxiv …, 2024 - arxiv.org
Neural audio codecs are initially introduced to compress audio data into compact codes to
reduce transmission latency. Researchers recently discovered the potential of codecs as …

Foundation models for music: A survey

Y Ma, A Øland, A Ragni, BMS Del Sette, C Saitis… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …

Givt: Generative infinite-vocabulary transformers

M Tschannen, C Eastwood, F Mentzer - European Conference on …, 2024 - Springer
Abstract We introduce Generative Infinite-Vocabulary Transformers (GIVT) which generate
vector sequences with real-valued entries, instead of discrete tokens from a finite …

Salm: Speech-augmented language model with in-context learning for speech recognition and translation

Z Chen, H Huang, A Andrusenko… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
We present a novel Speech Augmented Language Model (SALM) with multitask and in-
context learning capabilities. SALM comprises a frozen text LLM, a audio encoder, a …

Llms meet multimodal generation and editing: A survey

Y He, Z Liu, J Chen, Z Tian, H Liu, X Chi, R Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
With the recent advancement in large language models (LLMs), there is a growing interest in
combining LLMs with multimodal learning. Previous surveys of multimodal large language …

Codec-SUPERB: An in-depth analysis of sound codec models

H Wu, HL Chung, YC Lin, YK Wu, X Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
The sound codec's dual roles in minimizing data transmission latency and serving as
tokenizers underscore its critical importance. Recent years have witnessed significant …

Vall-t: Decoder-only generative transducer for robust and decoding-controllable text-to-speech

C Du, Y Guo, H Wang, Y Yang, Z Niu, S Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and
VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot …

Autoregressive diffusion transformer for text-to-speech synthesis

Z Liu, S Wang, S Inoue, Q Bai, H Li - arxiv preprint arxiv:2406.05551, 2024 - arxiv.org
Audio language models have recently emerged as a promising approach for various audio
generation tasks, relying on audio tokenizers to encode waveforms into sequences of …

Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities

Z **e, C Wu - arxiv preprint arxiv:2410.11190, 2024 - arxiv.org
GPT-4o, an all-encompassing model, represents a milestone in the development of large
multi-modal language models. It can understand visual, auditory, and textual modalities …

Boosting large language model for speech synthesis: An empirical study

H Hao, L Zhou, S Liu, J Li, S Hu, R Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
Large language models (LLMs) have made significant advancements in natural language
processing and are concurrently extending the language ability to other modalities, such as …