Open problems in technical ai governance
AI progress is creating a growing range of risks and opportunities, but it is often unclear how
they should be navigated. In many cases, the barriers and uncertainties faced are at least …
they should be navigated. In many cases, the barriers and uncertainties faced are at least …
Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models
As language models have scaled both their number of parameters and pretraining dataset
sizes, the computational cost for pretraining has become intractable except for the most well …
sizes, the computational cost for pretraining has become intractable except for the most well …
The Zamba2 Suite: Technical Report
In this technical report, we present the Zamba2 series--a suite of 1.2 B, 2.7 B, and 7.4 B
parameter hybrid Mamba2-transformer models that achieve state of the art performance …
parameter hybrid Mamba2-transformer models that achieve state of the art performance …
Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining
Pretraining large language models effectively requires strategic data selection, blending and
ordering. However, key details about data mixtures especially their scalability to longer …
ordering. However, key details about data mixtures especially their scalability to longer …
Zyda-2: a 5 trillion token high-quality dataset
In this technical report, we present Zyda-2: a five trillion token dataset for language model
pretraining. Zyda-2 was used to train our Zamba2 series of models which are state-of-the-art …
pretraining. Zyda-2 was used to train our Zamba2 series of models which are state-of-the-art …
Aioli: A unified optimization framework for language model data mixing
Language model performance depends on identifying the optimal mixture of data groups to
train on (eg, law, code, math). Prior work has proposed a diverse set of methods to efficiently …
train on (eg, law, code, math). Prior work has proposed a diverse set of methods to efficiently …
Fanar: An Arabic-Centric Multimodal Generative AI Platform
We present Fanar, a platform for Arabic-centric multimodal generative AI systems, that
supports language, speech and image generation tasks. At the heart of Fanar are Fanar Star …
supports language, speech and image generation tasks. At the heart of Fanar are Fanar Star …