Knowledge editing for large language models: A survey
Large Language Models (LLMs) have recently transformed both the academic and industrial
landscapes due to their remarkable capacity to understand, analyze, and generate texts …
landscapes due to their remarkable capacity to understand, analyze, and generate texts …
Foundation Models Defining a New Era in Vision: a Survey and Outlook
Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …
fundamental to understanding our world. The complex relations between objects and their …
Imagebind: One embedding space to bind them all
We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …
Audioldm: Text-to-audio generation with latent diffusion models
Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general
audio based on text descriptions. However, previous studies in TTA have limited generation …
audio based on text descriptions. However, previous studies in TTA have limited generation …
Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models
Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the …
text-to-video generation. Its application to audio still lags behind for two main reasons: the …
High fidelity neural audio compression
We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural
networks. It consists in a streaming encoder-decoder architecture with quantized latent …
networks. It consists in a streaming encoder-decoder architecture with quantized latent …
High-fidelity audio compression with improved rvqgan
Abstract Language models have been successfully used to model natural signals, such as
images, speech, and music. A key component of these models is a high quality neural …
images, speech, and music. A key component of these models is a high quality neural …
Any-to-any generation via composable diffusion
Abstract We present Composable Diffusion (CoDi), a novel generative model capable of
generating any combination of output modalities, such as language, image, video, or audio …
generating any combination of output modalities, such as language, image, video, or audio …
Data2vec: A general framework for self-supervised learning in speech, vision and language
While the general idea of self-supervised learning is identical across modalities, the actual
algorithms and objectives differ widely because they were developed with a single modality …
algorithms and objectives differ widely because they were developed with a single modality …
Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation
Contrastive learning has shown remarkable success in the field of multimodal
representation learning. In this paper, we propose a pipeline of contrastive language-audio …
representation learning. In this paper, we propose a pipeline of contrastive language-audio …