The rise and potential of large language model based agents: A survey
For a long time, researchers have sought artificial intelligence (AI) that matches or exceeds
human intelligence. AI agents, which are artificial entities capable of sensing the …
human intelligence. AI agents, which are artificial entities capable of sensing the …
A review of deep learning techniques for speech processing
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …
learning. The use of multiple processing layers has enabled the creation of models capable …
Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …
important to capture the diversity in human speech such as speaker identities, prosodies …
Viola: Conditional language models for speech recognition, synthesis, and translation
Recent research shows a big convergence in model architecture, training objectives, and
inference methods across various tasks for different modalities. In this paper, we propose …
inference methods across various tasks for different modalities. In this paper, we propose …
Uniaudio: An audio foundation model toward universal audio generation
Large Language models (LLM) have demonstrated the capability to handle a variety of
generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific …
generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific …
Voicecraft: Zero-shot speech editing and text-to-speech in the wild
We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-
of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on …
of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on …
Speechx: Neural codec language model as a versatile speech transformer
Recent advancements in generative speech models based on audio-text prompts have
enabled remarkable innovations like high-quality zero-shot text-to-speech. However …
enabled remarkable innovations like high-quality zero-shot text-to-speech. However …
SeamlessM4T-Massively Multilingual & Multimodal Machine Translation
What does it take to create the Babel Fish, a tool that can help individuals translate speech
between any two languages? While recent breakthroughs in text-based models have …
between any two languages? While recent breakthroughs in text-based models have …
Character-llm: A trainable agent for role-playing
Large language models (LLMs) can be used to serve as agents to simulate human
behaviors, given the powerful ability to understand human instructions and provide high …
behaviors, given the powerful ability to understand human instructions and provide high …
On decoder-only architecture for speech-to-text and large language model integration
Large language models (LLMs) have achieved remarkable success in the field of natural
language processing, enabling better human-computer interaction using natural language …
language processing, enabling better human-computer interaction using natural language …