Detecting near-duplicates for web crawling

GS Manku, A Jain, A Das Sarma - … of the 16th international conference on …, 2007 - dl.acm.org
Near-duplicate web documents are abundant. Two such documents differ from each other in
a very small portion that displays advertisements, for example. Such differences are …

Web crawling

C Olston, M Najork - Foundations and Trends® in Information …, 2010 - nowpublishers.com
This is a survey of the science and practice of web crawling. While at first glance web
crawling may appear to be merely an application of breadth-first-search, the truth is that …

System and method for URL fetching retry mechanism

D Shribman, O Vilenski - US Patent 10,963,531, 2021 - Google Patents
First worldwide family litigation filed litigation Critical https://patents. darts-ip. com/? family=
72239417&utm_source= google_patent&utm_medium= platform_link&utm_campaign …

LSH forest: self-tuning indexes for similarity search

M Bawa, T Condie, P Ganesan - … of the 14th international conference on …, 2005 - dl.acm.org
We consider the problem of indexing high-dimensional data for answering (approximate)
similarity-search queries. Similarity indexes prove to be important in a wide variety of …

A large-scale study of the evolution of web pages

D Fetterly, M Manasse, M Najork, J Wiener - Proceedings of the 12th …, 2003 - dl.acm.org
How fast does the web change? Does most of the content remain unchanged once it has
been authored, or are the documents continuously updated? Do pages change a little or a …

[PDF][PDF] Efficient exact set-similarity joins

A Arasu, V Ganti, R Kaushik - … of the 32nd international conference on Very …, 2006 - vldb.org
Given two input collections of sets, a set-similarity join (SSJoin) identifies all pairs of sets,
one from each collection, that have high similarity. Recent work has identified SSJoin as a …

Searching the web

A Arasu, J Cho, H Garcia-Molina, A Paepcke… - ACM Transactions on …, 2001 - dl.acm.org
We offer an overview of current Web search engine design. After introducing a generic
search engine architecture, we examine each engine component in turn. We cover crawling …

Automatic identification of user goals in web search

U Lee, Z Liu, J Cho - Proceedings of the 14th international conference …, 2005 - dl.acm.org
There has been recent interests in studying the" goal" behind a user's Web query, so that
this goal can be used to improve the quality of a search engine's results. Previous studies …

Application-specific Delta-encoding via Resemblance Detection.

F Douglis, A Iyengar - USENIX annual technical conference, general …, 2003 - usenix.org
Many objects, such as files, electronic messages, and web pages, contain overlap**
content. Numerous past research projects have observed that one can compress one object …

System and method for improving content fetching by selecting tunnel devices

D Shribman, O Vilenski - US Patent 11,190,374, 2021 - Google Patents
H04L69/168—Implementation or adaptation of Internet protocol [IP], of transmission control
protocol [TCP] or of user datagram protocol [UDP] specially adapted for link layer protocols …