A survey on data selection for language models

A Albalak, Y Elazar, SM **e, S Longpre… - arxiv preprint arxiv …, 2024 - arxiv.org
A major factor in the recent success of large language models is the use of enormous and
ever-growing text datasets for unsupervised pre-training. However, naively training a model …

Dolma: An open corpus of three trillion tokens for language model pretraining research

L Soldaini, R Kinney, A Bhagia, D Schwenk… - arxiv preprint arxiv …, 2024 - arxiv.org
Information about pretraining corpora used to train the current best-performing language
models is seldom discussed: commercial models rarely detail their data, and even open …

Web crawling

C Olston, M Najork - Foundations and Trends® in Information …, 2010 - nowpublishers.com
This is a survey of the science and practice of web crawling. While at first glance web
crawling may appear to be merely an application of breadth-first-search, the truth is that …

CopyCat: Near-Duplicates within and between the ClueWeb and the Common Crawl

M Fröbe, J Bevendorff, L Gienapp, M Völske… - Proceedings of the 44th …, 2021 - dl.acm.org
The amount of near-duplicates in web crawls like the ClueWeb or Common Crawl demands
from their users either to develop a preprocessing pipeline for deduplication, which is costly …

Learning url patterns for webpage de-duplication

HS Koppula, KP Leela, A Agarwal… - Proceedings of the third …, 2010 - dl.acm.org
Presence of duplicate documents in the World Wide Web adversely affects crawling,
indexing and relevance, which are the core building blocks of web search. In this paper, we …

A pattern tree-based approach to learning URL normalization rules

T Lei, R Cai, JM Yang, Y Ke, X Fan… - Proceedings of the 19th …, 2010 - dl.acm.org
Duplicate URLs have brought serious troubles to the whole pipeline of a search engine,
from crawling, indexing, to result serving. URL normalization is to transform duplicate URLs …

Fully Open Source Moxin-7B Technical Report

P Zhao, X Shen, Z Kong, Y Shen, SE Chang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recently, Large Language Models (LLMs) have undergone a significant transformation,
marked by a rapid rise in both their popularity and capabilities. Leading this evolution are …

Clue: Clustering for mining web urls

A Morichetta, E Bocchi, H Metwalley… - 2016 28th International …, 2016 - ieeexplore.ieee.org
The Internet has witnessed the proliferation of applications and services that rely on HTTP
as application protocol. Users play games, read emails, watch videos, chat and access web …

DSDD: Domain-Specific Dataset Discovery on the Web

H Zhang, A Santos, J Freire - Proceedings of the 30th ACM International …, 2021 - dl.acm.org
With the push for transparency and open data, many datasets and data repositories are
becoming available on the Web. This opens new opportunities for data-driven exploration …

The missing links: Discovering hidden same-as links among a billion of triples

G Papadakis, G Demartini, P Fankhauser… - Proceedings of the 12th …, 2010 - dl.acm.org
The Semantic Web is constantly gaining momentum, as more and more Web sites and
content providers adopt its principles. At the core of these principles lies the Linked Data …