Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms

K An, Q Chen, C Deng, Z Du, C Gao, Z Gao… - arxiv preprint arxiv …, 2024 - arxiv.org
This report introduces FunAudioLLM, a model family designed to enhance natural voice
interactions between humans and large language models (LLMs). At its core are two …

Wavchat: A survey of spoken dialogue models

S Ji, Y Chen, M Fang, J Zuo, J Lu, H Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …

Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey

T **e, Y Rong, P Zhang, L Liu - arxiv preprint arxiv:2412.06602, 2024 - arxiv.org
Text-to-speech (TTS), also known as speech synthesis, is a prominent research area that
aims to generate natural-sounding human speech from text. Recently, with the increasing …

Ace: A generative cross-modal retrieval framework with coarse-to-fine semantic modeling

M Fang, S Ji, J Zuo, H Huang, Y **a, J Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
Generative retrieval, which has demonstrated effectiveness in text-to-text retrieval, utilizes a
sequence-to-sequence model to directly generate candidate identifiers based on natural …

Minmo: A multimodal large language model for seamless voice interaction

Q Chen, Y Chen, Y Chen, M Chen, Y Chen… - arxiv preprint arxiv …, 2025 - arxiv.org
Recent advancements in large language models (LLMs) and multimodal speech-text
models have laid the groundwork for seamless voice interactions, enabling real-time …

Speech Watermarking with Discrete Intermediate Representations

S Ji, Z Jiang, J Zuo, M Fang, Y Chen, T **… - arxiv preprint arxiv …, 2024 - arxiv.org
Speech watermarking techniques can proactively mitigate the potential harmful
consequences of instant voice cloning techniques. These techniques involve the insertion of …

Semantic Residual for Multimodal Unified Discrete Representation

H Huang, S Wang, Y **a - arxiv preprint arxiv:2412.19128, 2024 - arxiv.org
Recent research in the domain of multimodal unified representations predominantly
employs codebook as representation forms, utilizing Vector Quantization (VQ) for …