Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
A survey on data selection for language models
A major factor in the recent success of large language models is the use of enormous and
ever-growing text datasets for unsupervised pre-training. However, naively training a model …
ever-growing text datasets for unsupervised pre-training. However, naively training a model …
Dolma: An open corpus of three trillion tokens for language model pretraining research
Information about pretraining corpora used to train the current best-performing language
models is seldom discussed: commercial models rarely detail their data, and even open …
models is seldom discussed: commercial models rarely detail their data, and even open …
Web crawling
This is a survey of the science and practice of web crawling. While at first glance web
crawling may appear to be merely an application of breadth-first-search, the truth is that …
crawling may appear to be merely an application of breadth-first-search, the truth is that …
CopyCat: Near-Duplicates within and between the ClueWeb and the Common Crawl
The amount of near-duplicates in web crawls like the ClueWeb or Common Crawl demands
from their users either to develop a preprocessing pipeline for deduplication, which is costly …
from their users either to develop a preprocessing pipeline for deduplication, which is costly …
Learning url patterns for webpage de-duplication
Presence of duplicate documents in the World Wide Web adversely affects crawling,
indexing and relevance, which are the core building blocks of web search. In this paper, we …
indexing and relevance, which are the core building blocks of web search. In this paper, we …
A pattern tree-based approach to learning URL normalization rules
Duplicate URLs have brought serious troubles to the whole pipeline of a search engine,
from crawling, indexing, to result serving. URL normalization is to transform duplicate URLs …
from crawling, indexing, to result serving. URL normalization is to transform duplicate URLs …
Fully Open Source Moxin-7B Technical Report
Recently, Large Language Models (LLMs) have undergone a significant transformation,
marked by a rapid rise in both their popularity and capabilities. Leading this evolution are …
marked by a rapid rise in both their popularity and capabilities. Leading this evolution are …
Clue: Clustering for mining web urls
The Internet has witnessed the proliferation of applications and services that rely on HTTP
as application protocol. Users play games, read emails, watch videos, chat and access web …
as application protocol. Users play games, read emails, watch videos, chat and access web …
DSDD: Domain-Specific Dataset Discovery on the Web
With the push for transparency and open data, many datasets and data repositories are
becoming available on the Web. This opens new opportunities for data-driven exploration …
becoming available on the Web. This opens new opportunities for data-driven exploration …
The missing links: Discovering hidden same-as links among a billion of triples
The Semantic Web is constantly gaining momentum, as more and more Web sites and
content providers adopt its principles. At the core of these principles lies the Linked Data …
content providers adopt its principles. At the core of these principles lies the Linked Data …