Moshi: a speech-text foundation model for real-time dialogue
We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue
framework. Current systems for spoken dialogue rely on pipelines of independent …
framework. Current systems for spoken dialogue rely on pipelines of independent …
Recent advances in speech language models: A survey
Large Language Models (LLMs) have recently garnered significant attention, primarily for
their capabilities in text-based interactions. However, natural human interaction often relies …
their capabilities in text-based interactions. However, natural human interaction often relies …
Llama-omni: Seamless speech interaction with large language models
Models like GPT-4o enable real-time interaction with large language models (LLMs) through
speech, significantly enhancing user experience compared to traditional text-based …
speech, significantly enhancing user experience compared to traditional text-based …
Wavchat: A survey of spoken dialogue models
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …
have captured significant attention in the speech domain. Compared to traditional three-tier …
Style-talker: Finetuning audio language model and style-based text-to-speech model for fast spoken dialogue generation
The rapid advancement of large language models (LLMs) has significantly propelled the
development of text-based chatbots, demonstrating their capability to engage in coherent …
development of text-based chatbots, demonstrating their capability to engage in coherent …
Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models
Building upon advancements in Large Language Models (LLMs), the field of audio
processing has seen increased interest in training audio generation tasks with discrete …
processing has seen increased interest in training audio generation tasks with discrete …
DFADD: The Diffusion and Flow-Matching Based Audio Deepfake Dataset
Mainstream zero-shot TTS production systems like Voicebox and Seed-TTS achieve human
parity speech by leveraging Flow-matching and Diffusion models, respectively …
parity speech by leveraging Flow-matching and Diffusion models, respectively …
Body of Her: A Preliminary Study on End-to-End Humanoid Agent
T Ao - arxiv preprint arxiv:2408.02879, 2024 - arxiv.org
Interactive virtual humanoid agent is a crucial interface with the physical world. A relatively
complete humanoid agent first needs to have face and body, then possess both verbal and …
complete humanoid agent first needs to have face and body, then possess both verbal and …
Neural Codec Source Tracing: Toward Comprehensive Attribution in Open-Set Condition
Current research in audio deepfake detection is gradually transitioning from binary
classification to multi-class tasks, referred as audio deepfake source tracing task. However …
classification to multi-class tasks, referred as audio deepfake source tracing task. However …
PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems
Multimodal language models that process both text and speech have a potential for
applications in spoken dialogue systems. However, current models face two major …
applications in spoken dialogue systems. However, current models face two major …