Multimodal biomedical AI

JN Acosta, GJ Falcone, P Rajpurkar, EJ Topol - Nature Medicine, 2022 - nature.com
The increasing availability of biomedical data from large biobanks, electronic health records,
medical imaging, wearable and ambient biosensors, and the lower cost of genome and …

A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

Dinov2: Learning robust visual features without supervision

M Oquab, T Darcet, T Moutakanni, H Vo… - arxiv preprint arxiv …, 2023 - arxiv.org
The recent breakthroughs in natural language processing for model pretraining on large
quantities of data have opened the way for similar foundation models in computer vision …

A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications

L Alzubaidi, J Bai, A Al-Sabaawi, J Santamaría… - Journal of Big Data, 2023 - Springer
Data scarcity is a major challenge when training deep learning (DL) models. DL demands a
large amount of data to achieve exceptional performance. Unfortunately, many applications …

Eva: Exploring the limits of masked visual representation learning at scale

Y Fang, W Wang, B **e, Q Sun, L Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com
We launch EVA, a vision-centric foundation model to explore the limits of visual
representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …

Self-supervised learning from images with a joint-embedding predictive architecture

M Assran, Q Duval, I Misra… - Proceedings of the …, 2023 - openaccess.thecvf.com
This paper demonstrates an approach for learning highly semantic image representations
without relying on hand-crafted data-augmentations. We introduce the Image-based Joint …

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

R Huang, J Huang, D Yang, Y Ren… - International …, 2023 - proceedings.mlr.press
Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Where are we in the search for an artificial visual cortex for embodied intelligence?

A Majumdar, K Yadav, S Arnaud, J Ma… - Advances in …, 2023 - proceedings.neurips.cc
We present the largest and most comprehensive empirical study of pre-trained visual
representations (PVRs) or visual 'foundation models' for Embodied AI. First, we curate …

Masked autoencoders for point cloud self-supervised learning

Y Pang, W Wang, FEH Tay, W Liu, Y Tian… - European conference on …, 2022 - Springer
As a promising scheme of self-supervised learning, masked autoencoding has significantly
advanced natural language processing and computer vision. Inspired by this, we propose a …