- Academic Search

A Albalak, Y Elazar, SM **e, S Longpre… - arxiv preprint arxiv …, 2024 - arxiv.org

A major factor in the recent success of large language models is the use of enormous and
ever-growing text datasets for unsupervised pre-training. However, naively training a model …

Lưu Trích dẫn Trích dẫn 86 bài viết Bài viết có liên quan Tất cả 3 phiên bản Xem dạng HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Dolma: An open corpus of three trillion tokens for language model pretraining research

L Soldaini, R Kinney, A Bhagia, D Schwenk… - arxiv preprint arxiv …, 2024 - arxiv.org

Information about pretraining corpora used to train the current best-performing language
models is seldom discussed: commercial models rarely detail their data, and even open …

Lưu Trích dẫn Trích dẫn 129 bài viết Bài viết có liên quan Tất cả 5 phiên bản Xem dạng HTML

[Free GPT-4]
[DeepSeek]

[PDF] nowpublishers.com

Web crawling

C Olston, M Najork - Foundations and Trends® in Information …, 2010 - nowpublishers.com

This is a survey of the science and practice of web crawling. While at first glance web
crawling may appear to be merely an application of breadth-first-search, the truth is that …

Lưu Trích dẫn Trích dẫn 594 bài viết Bài viết có liên quan Tất cả 24 phiên bản Tìm kiếm Thư viện Xem dạng HTML

[Free GPT-4]
[DeepSeek]

[PDF] webis.de

CopyCat: Near-Duplicates within and between the ClueWeb and the Common Crawl

M Fröbe, J Bevendorff, L Gienapp, M Völske… - Proceedings of the 44th …, 2021 - dl.acm.org

The amount of near-duplicates in web crawls like the ClueWeb or Common Crawl demands
from their users either to develop a preprocessing pipeline for deduplication, which is costly …

Lưu Trích dẫn Trích dẫn 20 bài viết Bài viết có liên quan Tất cả 2 phiên bản

[Free GPT-4]
[DeepSeek]

[PDF] academia.edu

Learning url patterns for webpage de-duplication

HS Koppula, KP Leela, A Agarwal… - Proceedings of the third …, 2010 - dl.acm.org

Presence of duplicate documents in the World Wide Web adversely affects crawling,
indexing and relevance, which are the core building blocks of web search. In this paper, we …

Lưu Trích dẫn Trích dẫn 83 bài viết Bài viết có liên quan Tất cả 9 phiên bản

[Free GPT-4]
[DeepSeek]

[PDF] microsoft.com

A pattern tree-based approach to learning URL normalization rules

T Lei, R Cai, JM Yang, Y Ke, X Fan… - Proceedings of the 19th …, 2010 - dl.acm.org

Duplicate URLs have brought serious troubles to the whole pipeline of a search engine,
from crawling, indexing, to result serving. URL normalization is to transform duplicate URLs …

Lưu Trích dẫn Trích dẫn 32 bài viết Bài viết có liên quan Tất cả 8 phiên bản

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Fully Open Source Moxin-7B Technical Report

P Zhao, X Shen, Z Kong, Y Shen, SE Chang… - arxiv preprint arxiv …, 2024 - arxiv.org

Recently, Large Language Models (LLMs) have undergone a significant transformation,
marked by a rapid rise in both their popularity and capabilities. Leading this evolution are …

Lưu Trích dẫn Bài viết có liên quan Tất cả 2 phiên bản Xem dạng HTML

[Free GPT-4]
[DeepSeek]

[PDF] core.ac.uk

Clue: Clustering for mining web urls

A Morichetta, E Bocchi, H Metwalley… - 2016 28th International …, 2016 - ieeexplore.ieee.org

The Internet has witnessed the proliferation of applications and services that rely on HTTP
as application protocol. Users play games, read emails, watch videos, chat and access web …

Lưu Trích dẫn Trích dẫn 18 bài viết Bài viết có liên quan Tất cả 4 phiên bản

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

DSDD: Domain-Specific Dataset Discovery on the Web

H Zhang, A Santos, J Freire - Proceedings of the 30th ACM International …, 2021 - dl.acm.org

With the push for transparency and open data, many datasets and data repositories are
becoming available on the Web. This opens new opportunities for data-driven exploration …

Lưu Trích dẫn Trích dẫn 6 bài viết Bài viết có liên quan Tất cả 4 phiên bản

[Free GPT-4]
[DeepSeek]

[PDF] academia.edu

The missing links: Discovering hidden same-as links among a billion of triples

G Papadakis, G Demartini, P Fankhauser… - Proceedings of the 12th …, 2010 - dl.acm.org

The Semantic Web is constantly gaining momentum, as more and more Web sites and
content providers adopt its principles. At the core of these principles lies the Linked Data …

Lưu Trích dẫn Trích dẫn 29 bài viết Bài viết có liên quan Tất cả 4 phiên bản

Tạo thông báo

Trích dẫn

Tìm kiếm nâng cao

Đã lưu vào Thư viện của tôi

URL normalization for de-duplication of web pages

A survey on data selection for language models

Dolma: An open corpus of three trillion tokens for language model pretraining research

Web crawling

CopyCat: Near-Duplicates within and between the ClueWeb and the Common Crawl

Learning url patterns for webpage de-duplication

A pattern tree-based approach to learning URL normalization rules

Fully Open Source Moxin-7B Technical Report

Clue: Clustering for mining web urls

DSDD: Domain-Specific Dataset Discovery on the Web

The missing links: Discovering hidden same-as links among a billion of triples