Advancing large language models to capture varied speaking styles and respond properly in spoken conversations

GT Lin, CH Chiang, H Lee - arxiv preprint arxiv:2402.12786, 2024 - arxiv.org
In spoken dialogue, even if two current turns are the same sentence, their responses might
still differ when they are spoken in different styles. The spoken styles, containing …

Wavchat: A survey of spoken dialogue models

S Ji, Y Chen, M Fang, J Zuo, J Lu, H Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …

Generative expressive conversational speech synthesis

R Liu, Y Hu, Y Ren, X Yin, H Li - Proceedings of the 32nd ACM …, 2024 - dl.acm.org
Conversational Speech Synthesis (CSS) aims to express a target utterance with the proper
speaking style in a user-agent conversation setting. Existing CSS methods employ effective …

Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback

GT Lin, PG Shivakumar, A Gourav, Y Gu… - arxiv preprint arxiv …, 2024 - arxiv.org
While textless Spoken Language Models (SLMs) have shown potential in end-to-end
speech-to-speech modeling, they still lag behind text-based Large Language Models …

Style-talker: Finetuning audio language model and style-based text-to-speech model for fast spoken dialogue generation

YA Li, X Jiang, J Darefsky, G Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid advancement of large language models (LLMs) has significantly propelled the
development of text-based chatbots, demonstrating their capability to engage in coherent …

Minmo: A multimodal large language model for seamless voice interaction

Q Chen, Y Chen, Y Chen, M Chen, Y Chen… - arxiv preprint arxiv …, 2025 - arxiv.org
Recent advancements in large language models (LLMs) and multimodal speech-text
models have laid the groundwork for seamless voice interactions, enabling real-time …

Universal Speech Token Learning Via Low-Bitrate Neural Codec and Pretrained Representations

X Jiang, X Peng, Y Zhang, Y Lu - IEEE Journal of Selected …, 2024 - ieeexplore.ieee.org
Current large speech language models are mainly based on semantic tokens from
discretization of self-supervised learned representations and acoustic tokens from a neural …

E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models

H Xue, Y Liang, B Mu, S Zhang, M Chen… - 2024 IEEE 14th …, 2024 - ieeexplore.ieee.org
This study focuses on emotion-sensitive spoken dialogue in human-machine speech
interaction. With the advancement of Large Language Models (LLMs), dialogue systems can …

Can LLMs Understand the Implication of Emphasized Sentences in Dialogue?

GT Lin, H Lee - arxiv preprint arxiv:2406.11065, 2024 - arxiv.org
Emphasis is a crucial component in human communication, which indicates the speaker's
intention and implication beyond pure text in dialogue. While Large Language Models …

Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech

W Kang, J Jia, C Wu, W Zhou, E Lakomkin… - arxiv preprint arxiv …, 2024 - arxiv.org
As speech becomes an increasingly common modality for interacting with large language
models (LLMs), it is becoming desirable to develop systems where LLMs can take into …