Multimodal biomedical AI
The increasing availability of biomedical data from large biobanks, electronic health records,
medical imaging, wearable and ambient biosensors, and the lower cost of genome and …
medical imaging, wearable and ambient biosensors, and the lower cost of genome and …
A review of deep learning techniques for speech processing
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …
learning. The use of multiple processing layers has enabled the creation of models capable …
Dinov2: Learning robust visual features without supervision
The recent breakthroughs in natural language processing for model pretraining on large
quantities of data have opened the way for similar foundation models in computer vision …
quantities of data have opened the way for similar foundation models in computer vision …
A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications
Data scarcity is a major challenge when training deep learning (DL) models. DL demands a
large amount of data to achieve exceptional performance. Unfortunately, many applications …
large amount of data to achieve exceptional performance. Unfortunately, many applications …
Eva: Exploring the limits of masked visual representation learning at scale
We launch EVA, a vision-centric foundation model to explore the limits of visual
representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …
representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …
Self-supervised learning from images with a joint-embedding predictive architecture
This paper demonstrates an approach for learning highly semantic image representations
without relying on hand-crafted data-augmentations. We introduce the Image-based Joint …
without relying on hand-crafted data-augmentations. We introduce the Image-based Joint …
Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models
Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the …
text-to-video generation. Its application to audio still lags behind for two main reasons: the …
Multimodal foundation models: From specialists to general-purpose assistants
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …
methods to data compression. Recent advances in statistical machine learning have opened …
Where are we in the search for an artificial visual cortex for embodied intelligence?
We present the largest and most comprehensive empirical study of pre-trained visual
representations (PVRs) or visual 'foundation models' for Embodied AI. First, we curate …
representations (PVRs) or visual 'foundation models' for Embodied AI. First, we curate …
Masked autoencoders for point cloud self-supervised learning
As a promising scheme of self-supervised learning, masked autoencoding has significantly
advanced natural language processing and computer vision. Inspired by this, we propose a …
advanced natural language processing and computer vision. Inspired by this, we propose a …