Trafilatura: A web scra** library and command-line tool for text discovery and extraction
A Barbaresi - Proceedings of the 59th Annual Meeting of the …, 2021 - aclanthology.org
An essential operation in web corpus construction consists in retaining the desired content
while discarding the rest. Another challenge finding one's way through websites. This article …
while discarding the rest. Another challenge finding one's way through websites. This article …
Rapid development of a data visualization service in an emergency response
We present the design and development of a data visualization service (RAMPVIS) in
response to the urgent need to support epidemiological modeling workflows during the …
response to the urgent need to support epidemiological modeling workflows during the …
An Empirical Comparison of Web Content Extraction Algorithms
Main content extraction from web pages-sometimes also called boilerplate removal-has
been a research topic for over two decades. Yet despite web pages being delivered in a …
been a research topic for over two decades. Yet despite web pages being delivered in a …
Extracting the main content of web pages using the First Impression Area
Extracting the main content from a web page is essential in various applications such as
web crawlers and browser reader modes. Existing extraction methods using text-based …
web crawlers and browser reader modes. Existing extraction methods using text-based …
JSAnalyzer: a Web developer tool for simplifying mobile Web pages through non-critical JavaScript elimination
The amount of JavaScript used in web pages has substantially grown in the past decade,
leading to large and complex pages that are computationally intensive for handheld mobile …
leading to large and complex pages that are computationally intensive for handheld mobile …
A regular expression generator based on CSS selectors for efficient extraction from HTML pages
E Uzun - Turkish Journal of Electrical Engineering and …, 2020 - journals.tubitak.gov.tr
Cascading style sheets (CSS) selectors are patterns used to select HTML elements. They
are often preferred in web data extraction because they are easy to prepare and have short …
are often preferred in web data extraction because they are easy to prepare and have short …