A comprehensive review of multimodal large language models: Performance and challenges across different tasks
In an era defined by the explosive growth of data and rapid technological advancements,
Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence …
Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence …
Audiolm: a language modeling approach to audio generation
We introduce AudioLM, a framework for high-quality audio generation with long-term
consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts …
consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts …
Audiogpt: Understanding and generating speech, music, sound, and talking head
Large language models (LLMs) have exhibited remarkable capabilities across a variety of
domains and tasks, challenging our understanding of learning and cognition. Despite the …
domains and tasks, challenging our understanding of learning and cognition. Despite the …
Make-a-voice: Unified voice synthesis with discrete representation
Various applications of voice synthesis have been developed independently despite the fact
that they generate" voice" as output in common. In addition, the majority of voice synthesis …
that they generate" voice" as output in common. In addition, the majority of voice synthesis …
Wavchat: A survey of spoken dialogue models
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …
have captured significant attention in the speech domain. Compared to traditional three-tier …
Are discrete units necessary for spoken language modeling?
Recent work in spoken language modeling shows the possibility of learning a language
unsupervisedly from raw audio without any text labels. The approach relies first on …
unsupervisedly from raw audio without any text labels. The approach relies first on …
Speechprompt: Prompting speech language models for speech processing tasks
Prompting has become a practical method for utilizing pre-trained language models (LMs).
This approach offers several advantages. It allows an LM to adapt to new tasks with minimal …
This approach offers several advantages. It allows an LM to adapt to new tasks with minimal …
Disentangling prosody representations with unsupervised speech reconstruction
Human speech can be characterized by different components, including semantic content,
speaker identity and prosodic information. Significant progress has been made in …
speaker identity and prosodic information. Significant progress has been made in …
Paralinguistic privacy protection at the edge
Voice user interfaces and digital assistants are rapidly entering our lives and becoming
singular touch points spanning our devices. These always-on services capture and transmit …
singular touch points spanning our devices. These always-on services capture and transmit …
Evolutionary Retrofitting
AfterLearnER (After Learning Evolutionary Retrofitting) consists in applying non-
differentiable optimization, including evolutionary methods, to refine fully-trained machine …
differentiable optimization, including evolutionary methods, to refine fully-trained machine …