A comprehensive survey on pretrained foundation models: A history from bert to chatgpt

C Zhou, Q Li, C Li, J Yu, Y Liu, G Wang… - International Journal of …, 2024 - Springer
Abstract Pretrained Foundation Models (PFMs) are regarded as the foundation for various
downstream tasks across different data modalities. A PFM (eg, BERT, ChatGPT, GPT-4) is …

Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Imagebind: One embedding space to bind them all

R Girdhar, A El-Nouby, Z Liu, M Singh… - Proceedings of the …, 2023 - openaccess.thecvf.com
We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Stablerep: Synthetic images from text-to-image models make strong visual representation learners

Y Tian, L Fan, P Isola, H Chang… - Advances in Neural …, 2024 - proceedings.neurips.cc
We investigate the potential of learning visual representations using synthetic images
generated by text-to-image models. This is a natural question in the light of the excellent …

Fake it till you make it: Learning transferable representations from synthetic imagenet clones

MB Sarıyıldız, K Alahari, D Larlus… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recent image generation models such as Stable Diffusion have exhibited an impressive
ability to generate fairly realistic images starting from a simple text prompt. Could such …

Masked siamese networks for label-efficient learning

M Assran, M Caron, I Misra, P Bojanowski… - … on Computer Vision, 2022 - Springer
Abstract We propose Masked Siamese Networks (MSN), a self-supervised learning
framework for learning image representations. Our approach matches the representation of …

Versatile diffusion: Text, images and variations all in one diffusion model

X Xu, Z Wang, G Zhang, K Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recent advances in diffusion models have set an impressive milestone in many generation
tasks, and trending works such as DALL-E2, Imagen, and Stable Diffusion have attracted …

St-adapter: Parameter-efficient image-to-video transfer learning

J Pan, Z Lin, X Zhu, J Shao, H Li - Advances in Neural …, 2022 - proceedings.neurips.cc
Capitalizing on large pre-trained models for various downstream tasks of interest have
recently emerged with promising performance. Due to the ever-growing model size, the …

Aligning bag of regions for open-vocabulary object detection

S Wu, W Zhang, S **, W Liu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Pre-trained vision-language models (VLMs) learn to align vision and language
representations on large-scale datasets, where each image-text pair usually contains a bag …