- Academic Search

D Grangier, S Fan, S Seto, P Ablin - arxiv preprint arxiv:2410.03735, 2024 - arxiv.org

Specialist language models (LMs) focus on a specific task or domain on which they often
outperform generalist LMs of the same size. However, the specialist data needed to pretrain …

Speichern Zitieren Zitiert von: 2 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

J Hayase, A Liu, Y Choi, S Oh, NA Smith - arxiv preprint arxiv:2407.16607, 2024 - arxiv.org

The pretraining data of today's strongest language models is opaque; in particular, little is
known about the proportions of various domains or languages represented. In this work, we …

Speichern Zitieren Zitiert von: 3 Ähnliche Artikel Alle 4 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World

J Kazdan, R Schaeffer, A Dey, M Gerstgrasser… - arxiv preprint arxiv …, 2024 - arxiv.org

The increasing presence of AI-generated content on the internet raises a critical question:
What happens when generative machine learning models are pretrained on web-scale …

Speichern Zitieren Zitiert von: 2 Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] openreview.net

Data Mixture Inference Attack: BPE Tokenizers Reveal Training Data Compositions

J Hayase, A Liu, Y Choi, S Oh… - The Thirty-eighth Annual …, 2024 - openreview.net

The pretraining data of today's strongest language models remains opaque, even when their
parameters are open-sourced. In particular, little is known about the proportions of different …

Speichern Zitieren Zitiert von: 1 Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Optimizing Pretraining Data Mixtures with LLM-Estimated Utility

W Held, B Paranjape, PS Koura, M Lewis… - arxiv preprint arxiv …, 2025 - arxiv.org

Large Language Models improve with increasing amounts of high-quality training data.
However, leveraging larger datasets requires balancing quality, quantity, and diversity …

Speichern Zitieren Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models

A Havrilla, A Dai, L O'Mahony, K Oostermeijer… - arxiv preprint arxiv …, 2024 - arxiv.org

Synthetic data generation with Large Language Models is a promising paradigm for
augmenting natural data over a nearly infinite range of tasks. Given this variety, direct …

Speichern Zitieren Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] openreview.net

Accumulating Data Avoids Model Collapse

J Kazdan, A Dey, R Schaeffer, M Gerstgrasser… - NeurIPS 2024 Workshop … - openreview.net

The increasing prevalence of AI-generated content on the internet raises a critical and timely
question: What happens when generative machine learning models are pretrained on web …

Speichern Zitieren Ähnliche Artikel Alle 2 Versionen HTML-Version

Alert erstellen

Zitieren

Erweiterte Suche

In „Meine Bibliothek“ gespeichert

A survey on data selection for language models, 2024

Task-adaptive pretrained language models via clustered-importance sampling

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World

Data Mixture Inference Attack: BPE Tokenizers Reveal Training Data Compositions

Optimizing Pretraining Data Mixtures with LLM-Estimated Utility

Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models

Accumulating Data Avoids Model Collapse