Software:Frontera (web crawling)

From HandWiki
Revision as of 14:20, 16 May 2022 by imported>Jport (url)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Frontera
Original author(s)Alexander Sibiryakov, Javier Casas
Developer(s)Scrapinghub Ltd., GitHub community
Initial releaseNovember 1, 2014; 10 years ago (2014-11-01)
Stable release
v0.8.1 / April 5, 2019; 5 years ago (2019-04-05)[1]
Written inPython
Operating systemOS X, Linux
Typeweb crawling
LicenseBSD 3-clause license
Websitegithub.com/scrapinghub/frontera

Frontera is an open-source, web crawling framework implementing crawl frontier component and providing scalability primitives for web crawler applications.

Overview

Large scale web crawlers often operate in batch mode with sequential phases of injection, fetching, parsing, deduplication, and scheduling. This leads to a delay in updating the crawl when the web changes. The design is mostly motivated by the relatively low random access performance of hard disks compared to sequential access.

Frontera instead relies on key value storage systems, using efficient data structures and powerful hardware to allow crawling, parsing and schedule indexing of new links concurrently. It is an open-source project designed to fit various use cases, with high flexibility and configurability.

Large-scale web crawls are Frontera's main purpose. It allows crawls of moderate size on a single machine with a few cores by using single process and distributed spiders run modes.

Features

Frontera is written mainly in Python. Data transport and formats are well abstracted and out-of-box implementations include support of MessagePack, JSON, Kafka and ZeroMQ.

  • Online operation: small requests batches, with parsing done right after fetch.
  • Pluggable backend architecture: low-level storage logic is separated from crawling policy.
  • Three run modes: single process, distributed spiders, distributed backend and spiders.
  • Transparent data flow, allowing to integrate custom components easily.
  • Message bus abstraction, providing a way to implement your own transport (ZeroMQ and Kafka are available out of the box).
  • SQLAlchemy and HBase storage backends.
  • Revisiting logic (only with RDBMS backend).
  • Optional use of Scrapy for fetching and parsing.
  • BSD 3-clause license, allowing to use in any commercial product.
  • Python 3 support.

Comparison to other web crawlers

Although, Frontera isn't a web crawler itself, it requires a streaming crawling architecture rather than a batch crawling approach.[citation needed]

StormCrawler is another stream-oriented crawler built on top of Apache Storm whilst using some components from the Apache Nutch ecosystem. Scrapy Cluster was designed by ISTResearch with precise monitoring and management of the queue in mind. These systems provide fetching and/or queueing mechanisms, but no link database or content processing.

Battle testing

At Scrapinghub Ltd. there is a crawler processing 1600 requests per second at peak, built using primarily Frontera using Kafka as a message bus and HBase as storage for link states and link database. Such crawler operates in cycles, each cycle takes 1.5 months and results in 1.7B of downloaded pages.[2]

Crawl of Spanish internet resulted in 46.5M pages in 1.5 months on AWS cluster with 2 spider machines.[3]

History

First version of Frontera operated in single process, as part of custom scheduler for Scrapy, using on-disk SQLite database to store link states and queue. It was able to crawl for days. After getting to some noticeable volume of links it started to spend more and more time on SELECT queries, making crawl inefficient. This time Frontera is developed under DARPA's Memex program and included in its catalog of open source projects.[4]

In 2015 subsequent versions of Frontera used HBase for storing link database and queue. Application was distributed on two parts: backend and fetcher. Backend was responsible for communicating with HBase by means of Kafka and fetcher was only reading Kafka topic with URLs to crawl, and producing crawl results to another topic consumed by backend, thus creating a closed cycle. First priority queue prototype suitable for web scale crawling was implemented during that time. The queue was producing batches with limits on a number of hosts and requests per host.

Next significant milestone of Frontera development was the introduction of crawling strategy and strategy worker, along with abstraction of the message bus. It became possible to code the custom crawling strategy without dealing with low-level backend code operating with the queue. An easy way to say what links should be scheduled, when and with what priority made Frontera a truly crawl frontier framework. Kafka was quite a heavy requirement for small crawlers and message bus abstraction allowed to integrate almost any messaging system with Frontera.

See also

References