Open problems in technical ai governance

A Reuel, B Bucknall, S Casper, T Fist, L Soder… - arxiv preprint arxiv …, 2024 - arxiv.org
AI progress is creating a growing range of risks and opportunities, but it is often unclear how
they should be navigated. In many cases, the barriers and uncertainties faced are at least …

Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models

J Parmar, S Satheesh, M Patwary, M Shoeybi… - arxiv preprint arxiv …, 2024 - arxiv.org
As language models have scaled both their number of parameters and pretraining dataset
sizes, the computational cost for pretraining has become intractable except for the most well …

The Zamba2 Suite: Technical Report

P Glorioso, Q Anthony, Y Tokpanov, A Golubeva… - arxiv preprint arxiv …, 2024 - arxiv.org
In this technical report, we present the Zamba2 series--a suite of 1.2 B, 2.7 B, and 7.4 B
parameter hybrid Mamba2-transformer models that achieve state of the art performance …

Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining

S Feng, S Prabhumoye, K Kong, D Su… - arxiv preprint arxiv …, 2024 - arxiv.org
Pretraining large language models effectively requires strategic data selection, blending and
ordering. However, key details about data mixtures especially their scalability to longer …

Zyda-2: a 5 trillion token high-quality dataset

Y Tokpanov, P Glorioso, Q Anthony… - arxiv preprint arxiv …, 2024 - arxiv.org
In this technical report, we present Zyda-2: a five trillion token dataset for language model
pretraining. Zyda-2 was used to train our Zamba2 series of models which are state-of-the-art …

Aioli: A unified optimization framework for language model data mixing

MF Chen, MY Hu, N Lourie, K Cho, C Ré - arxiv preprint arxiv:2411.05735, 2024 - arxiv.org
Language model performance depends on identifying the optimal mixture of data groups to
train on (eg, law, code, math). Prior work has proposed a diverse set of methods to efficiently …

Fanar: An Arabic-Centric Multimodal Generative AI Platform

F Team, U Abbas, MS Ahmad, F Alam… - arxiv preprint arxiv …, 2025 - arxiv.org
We present Fanar, a platform for Arabic-centric multimodal generative AI systems, that
supports language, speech and image generation tasks. At the heart of Fanar are Fanar Star …