Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

A comprehensive review on generative ai for education

U Mittal, S Sai, V Chamola - IEEE Access, 2024 - ieeexplore.ieee.org
Artificial Intelligence (AI) has immense potential for personalized learning experiences,
content generation, and vivid educational support. This paper delves into generative AI (GAI) …

Simple and controllable music generation

J Copet, F Kreuk, I Gat, T Remez… - Advances in …, 2024 - proceedings.neurips.cc
We tackle the task of conditional music generation. We introduce MusicGen, a single
Language Model (LM) that operates over several streams of compressed discrete music …

Shap-e: Generating conditional 3d implicit functions

H Jun, A Nichol - arxiv preprint arxiv:2305.02463, 2023 - arxiv.org
We present Shap-E, a conditional generative model for 3D assets. Unlike recent work on 3D
generative models which produce a single output representation, Shap-E directly generates …

Towards generalist biomedical AI

T Tu, S Azizi, D Driess, M Schaekermann, M Amin… - NEJM AI, 2024 - ai.nejm.org
Background Medicine is inherently multimodal, requiring the simultaneous interpretation
and integration of insights between many data modalities spanning text, imaging, genomics …

Photorealistic video generation with diffusion models

A Gupta, L Yu, K Sohn, X Gu, M Hahn, FF Li… - … on Computer Vision, 2024 - Springer
We present WALT, a diffusion transformer for photorealistic video generation from text
prompts. Our approach has two key design decisions. First, we use a causal encoder to …

Videopoet: A large language model for zero-shot video generation

D Kondratyuk, L Yu, X Gu, J Lezama, J Huang… - arxiv preprint arxiv …, 2023 - arxiv.org
We present VideoPoet, a language model capable of synthesizing high-quality video, with
matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder …

High-fidelity audio compression with improved rvqgan

R Kumar, P Seetharaman, A Luebs… - Advances in Neural …, 2024 - proceedings.neurips.cc
Abstract Language models have been successfully used to model natural signals, such as
images, speech, and music. A key component of these models is a high quality neural …

On the robustness of chatgpt: An adversarial and out-of-distribution perspective

J Wang, X Hu, W Hou, H Chen, R Zheng… - arxiv preprint arxiv …, 2023 - arxiv.org
ChatGPT is a recent chatbot service released by OpenAI and is receiving increasing
attention over the past few months. While evaluations of various aspects of ChatGPT have …

Generative ai

S Feuerriegel, J Hartmann, C Janiesch… - Business & Information …, 2024 - Springer
Tom Freston is credited with saying ''Innovation is taking two things that exist and putting
them together in a new way''. For a long time in history, it has been the prevailing …