Dom based content extraction via text density

F Sun, D Song, L Liao - Proceedings of the 34th international ACM SIGIR …, 2011 - dl.acm.org
In addition to the main content, most web pages also contain navigation panels,
advertisements and copyright and disclaimer notices. This additional content, which is also …

CETR: content extraction via tag ratios

T Weninger, WH Hsu, J Han - … of the 19th international conference on …, 2010 - dl.acm.org
We present Content Extraction via Tag Ratios (CETR)-a method to extract content text from
diverse webpages by using the HTML document's tag ratios. We describe how to compute …

Graphical user interface based sensitive information and internal information vulnerability management system

S Huang, F Huang, L Ren, A Dong - US Patent 8,140,664, 2012 - Google Patents
2. Description of the Related Art As computers and networks become more proliferated,
powerful, and affordable, a growing number of enterprises are using both to perform critical …

Extracting article text from the web with maximum subsequence segmentation

J Pasternack, D Roth - Proceedings of the 18th international conference …, 2009 - dl.acm.org
Much of the information on the Web is found in articles from online news outlets, magazines,
encyclopedias, review collections, and other sources. However, extracting this content from …

Tracking web spam with html style similarities

T Urvoy, E Chauveau, P Filoche… - ACM Transactions on the …, 2008 - dl.acm.org
Automatically generated content is ubiquitous in the web: dynamic sites built using the three-
tier paradigm are good examples (eg, commercial sites, blogs and other sites edited using …

Automatic web content extraction by combination of learning and grou**

S Wu, J Liu, J Fan - Proceedings of the 24th international conference on …, 2015 - dl.acm.org
Web pages consist of not only actual content, but also other elements such as branding
banners, navigational elements, advertisements, copyright etc. This noisy content is typically …

A hybrid approach for content extraction with text density and visual importance of DOM nodes

D Song, F Sun, L Liao - Knowledge and Information Systems, 2015 - Springer
Additional contents in web pages, such as navigation panels, advertisements, copyrights
and disclaimer notices, are typically not related to the main subject and may hamper the …

A comprehensive survey on web content extraction algorithms and techniques

SM Al-Ghuribi, S Alshomrani - 2013 International Conference …, 2013 - ieeexplore.ieee.org
Web Content Extraction is an important problem that has been studied through different
approaches and algorithms. It is interested in extracting meaningful and useful data from the …

[PDF][PDF] Victor: the web-page cleaning tool

M Spousta, M Marek, P Pecina - 4th Web as Corpus Workshop …, 2008 - academia.edu
In this paper we present a complete solution for automatic cleaning of arbitrary HTML pages
with a goal of using web data as a corpus in the area of natural language processing and …

Implementation and evaluation of a quality-based search engine

T Mandl - Proceedings of the seventeenth conference on …, 2006 - dl.acm.org
In this paper, an approach for the implementation of a quality-based Web search engine is
proposed. Quality retrieval is introduced and an overview on previous efforts to implement …