Knowledge editing for large language models: A survey

S Wang, Y Zhu, H Liu, Z Zheng, C Chen, J Li - ACM Computing Surveys, 2024‏ - dl.acm.org
Large Language Models (LLMs) have recently transformed both the academic and industrial
landscapes due to their remarkable capacity to understand, analyze, and generate texts …

Foundation Models Defining a New Era in Vision: a Survey and Outlook

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025‏ - ieeexplore.ieee.org
Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

Imagebind: One embedding space to bind them all

R Girdhar, A El-Nouby, Z Liu, M Singh… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …

High-fidelity audio compression with improved rvqgan

R Kumar, P Seetharaman, A Luebs… - Advances in Neural …, 2023‏ - proceedings.neurips.cc
Abstract Language models have been successfully used to model natural signals, such as
images, speech, and music. A key component of these models is a high quality neural …

Audioldm: Text-to-audio generation with latent diffusion models

H Liu, Z Chen, Y Yuan, X Mei, X Liu, D Mandic… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general
audio based on text descriptions. However, previous studies in TTA have limited generation …

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

R Huang, J Huang, D Yang, Y Ren… - International …, 2023‏ - proceedings.mlr.press
Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the …

Any-to-any generation via composable diffusion

Z Tang, Z Yang, C Zhu, M Zeng… - Advances in Neural …, 2023‏ - proceedings.neurips.cc
Abstract We present Composable Diffusion (CoDi), a novel generative model capable of
generating any combination of output modalities, such as language, image, video, or audio …

High fidelity neural audio compression

A Défossez, J Copet, G Synnaeve, Y Adi - arxiv preprint arxiv:2210.13438, 2022‏ - arxiv.org
We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural
networks. It consists in a streaming encoder-decoder architecture with quantized latent …

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset

S Chen, H Li, Q Wang, Z Zhao… - Advances in Neural …, 2023‏ - proceedings.neurips.cc
Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …

Perception test: A diagnostic benchmark for multimodal video models

V Patraucean, L Smaira, A Gupta… - Advances in …, 2023‏ - proceedings.neurips.cc
We propose a novel multimodal video benchmark-the Perception Test-to evaluate the
perception and reasoning skills of pre-trained multimodal models (eg Flamingo, BEiT-3, or …