Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

G Team, P Georgiev, VI Lei, R Burnell, L Bai… - arxiv preprint arxiv …, 2024 - arxiv.org
In this report, we introduce the Gemini 1.5 family of models, representing the next generation
of highly compute-efficient multimodal models capable of recalling and reasoning over fine …

Scaling vision transformers to 22 billion parameters

M Dehghani, J Djolonga, B Mustafa… - International …, 2023 - proceedings.mlr.press
The scaling of Transformers has driven breakthrough capabilities for language models. At
present, the largest large language models (LLMs) contain upwards of 100B parameters …

A general theoretical paradigm to understand learning from human preferences

MG Azar, ZD Guo, B Piot, R Munos… - International …, 2024 - proceedings.mlr.press
The prevalent deployment of learning from human preferences through reinforcement
learning (RLHF) relies on two important approximations: the first assumes that pairwise …

Pali: A jointly-scaled multilingual language-image model

X Chen, X Wang, S Changpinyo… - arxiv preprint arxiv …, 2022 - arxiv.org
Effective scaling and a flexible task interface enable large language models to excel at many
tasks. We present PaLI (Pathways Language and Image model), a model that extends this …

Offline reinforcement learning with implicit q-learning

I Kostrikov, A Nair, S Levine - arxiv preprint arxiv:2110.06169, 2021 - arxiv.org
Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that
improves over the behavior policy that collected the dataset, while at the same time …

Training diffusion models with reinforcement learning

K Black, M Janner, Y Du, I Kostrikov… - arxiv preprint arxiv …, 2023 - arxiv.org
Diffusion models are a class of flexible generative models trained with an approximation to
the log-likelihood objective. However, most use cases of diffusion models are not concerned …

Scaling open-vocabulary object detection

M Minderer, A Gritsenko… - Advances in Neural …, 2024 - proceedings.neurips.cc
Open-vocabulary object detection has benefited greatly from pretrained vision-language
models, but is still limited by the amount of available detection training data. While detection …

Structured denoising diffusion models in discrete state-spaces

J Austin, DD Johnson, J Ho, D Tarlow… - Advances in …, 2021 - proceedings.neurips.cc
Denoising diffusion probabilistic models (DDPMs)[Ho et al. 2021] have shown impressive
results on image and waveform generation in continuous state spaces. Here, we introduce …

On efficient training of large-scale deep learning models: A literature review

L Shen, Y Sun, Z Yu, L Ding, X Tian, D Tao - arxiv preprint arxiv …, 2023 - arxiv.org
The field of deep learning has witnessed significant progress, particularly in computer vision
(CV), natural language processing (NLP), and speech. The use of large-scale models …

Patch n'pack: Navit, a vision transformer for any aspect ratio and resolution

M Dehghani, B Mustafa, J Djolonga… - Advances in …, 2024 - proceedings.neurips.cc
The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution
before processing them with computer vision models has not yet been successfully …