Knowledge editing for large language models: A survey
Large Language Models (LLMs) have recently transformed both the academic and industrial
landscapes due to their remarkable capacity to understand, analyze, and generate texts …
landscapes due to their remarkable capacity to understand, analyze, and generate texts …
Foundation Models Defining a New Era in Vision: a Survey and Outlook
Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …
fundamental to understanding our world. The complex relations between objects and their …
Imagebind: One embedding space to bind them all
We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …
High-fidelity audio compression with improved rvqgan
Abstract Language models have been successfully used to model natural signals, such as
images, speech, and music. A key component of these models is a high quality neural …
images, speech, and music. A key component of these models is a high quality neural …
Audioldm: Text-to-audio generation with latent diffusion models
Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general
audio based on text descriptions. However, previous studies in TTA have limited generation …
audio based on text descriptions. However, previous studies in TTA have limited generation …
Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models
Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the …
text-to-video generation. Its application to audio still lags behind for two main reasons: the …
Any-to-any generation via composable diffusion
Abstract We present Composable Diffusion (CoDi), a novel generative model capable of
generating any combination of output modalities, such as language, image, video, or audio …
generating any combination of output modalities, such as language, image, video, or audio …
High fidelity neural audio compression
We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural
networks. It consists in a streaming encoder-decoder architecture with quantized latent …
networks. It consists in a streaming encoder-decoder architecture with quantized latent …
Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset
Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …
while other modalities such as audio and subtitles in videos have not received sufficient …
Perception test: A diagnostic benchmark for multimodal video models
We propose a novel multimodal video benchmark-the Perception Test-to evaluate the
perception and reasoning skills of pre-trained multimodal models (eg Flamingo, BEiT-3, or …
perception and reasoning skills of pre-trained multimodal models (eg Flamingo, BEiT-3, or …