A high-performance neuroprosthesis for speech decoding and avatar control
Speech neuroprostheses have the potential to restore communication to people living with
paralysis, but naturalistic speed and expressivity are elusive. Here we use high-density …
paralysis, but naturalistic speed and expressivity are elusive. Here we use high-density …
High fidelity neural audio compression
We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural
networks. It consists in a streaming encoder-decoder architecture with quantized latent …
networks. It consists in a streaming encoder-decoder architecture with quantized latent …
Audiogen: Textually guided audio generation
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive generative model that generates …
In this work, we propose AaudioGen, an auto-regressive generative model that generates …
Foundation models for music: A survey
In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …
Scaling laws for generative mixed-modal language models
Generative language models define distributions over sequences of tokens that can
represent essentially any combination of data modalities (eg, any permutation of image …
represent essentially any combination of data modalities (eg, any permutation of image …
Textually pretrained speech language models
Speech language models (SpeechLMs) process and generate acoustic data only, without
textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using …
textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using …
SpiRit-LM: Interleaved Spoken and Written Language Model
We introduce SpiRit-lm, a foundation multimodal language model that freely mixes text and
speech. Our model is based on a 7B pretrained text language model that we extend to the …
speech. Our model is based on a 7B pretrained text language model that we extend to the …
SeamlessM4T-Massively Multilingual & Multimodal Machine Translation
What does it take to create the Babel Fish, a tool that can help individuals translate speech
between any two languages? While recent breakthroughs in text-based models have …
between any two languages? While recent breakthroughs in text-based models have …
Seamless: Multilingual Expressive and Streaming Speech Translation
Large-scale automatic speech translation systems today lack key features that help machine-
mediated communication feel seamless when compared to human-to-human dialogue. In …
mediated communication feel seamless when compared to human-to-human dialogue. In …
CVSS corpus and massively multilingual speech-to-speech translation
We introduce CVSS, a massively multilingual-to-English speech-to-speech translation
(S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English …
(S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English …