Software:Crawlee
Developer(s) | Apify |
---|---|
Initial release | 13 July 2022 |
Written in | Typescript, Python |
Operating system | Windows, macOS, Linux |
Type | Web crawler |
License | Apache License 2.0 |
Crawlee is a free and open-source web-crawling and browser automation library developed by Apify. The original TypeScript version was first released in 2022, with a Python version added in 2024.
Crawlee's architecture is built around modular crawlers responsible for extracting data from websites[1]. The library follows a declarative programming approach, where users define crawling logic through a structured set of rules. Crawlee uses queues to manage requests; for each request, a specific function is executed to extract data or perform further processing[2].
Crawlee supports both headless browser sessions (via Playwright and other browser automation software) and plain HTTP request-based scraping.
It also provides various web-scraping-related utilities, such as a sitemap parser[3] or an automatic HTTP proxy manager.
Notable mentions of Crawlee's use in web-crawling projects include GPT Crawler by Builder.io[4] and various generative AI projects maintained by AWS Labs[5].
History
The first stable TypeScript version was released in 2021 under the name Apify SDK[6]. This version offered both the open-source crawling framework and the proprietary storage implementation for use on the Apify platform.
In 2022, version v3.0.0 was released[7], renaming the library to Crawlee. This update made Crawlee independent of the Apify Platform, moving most of the Apify-specific features into a separate package (also named Apify SDK).
In 2024, a beta version of Crawlee for Python was released[8].
References
- ↑ Koekemoer, Jakkie. "Web Scraping with Crawlee: Step-By-Step Tutorial". https://brightdata.com/blog/web-data/web-scraping-with-crawlee.
- ↑ Nechytailo, Yelyzaveta. "Crawlee Tutorial: Easy Web Scraping and Browser Automation" (in en). https://oxylabs.io/blog/crawlee-web-scraping-tutorial.
- ↑ "Release v3.7.0 · apify/crawlee" (in en). https://github.com/apify/crawlee/releases/tag/v3.7.0.
- ↑ "BuilderIO/gpt-crawler: Crawl a site to generate knowledge files to create your own custom GPT from a URL". https://github.com/BuilderIO/gpt-crawler.
- ↑ "awslabs/generative-ai-cdk-constructs: AWS Generative AI CDK Constructs are sample implementations of AWS CDK for common generative AI patterns.". Amazon Web Services - Labs. 20 September 2024. https://github.com/awslabs/generative-ai-cdk-constructs.
- ↑ "Release v1.0.0 · apify/crawlee" (in en). https://github.com/apify/crawlee/releases/tag/v1.0.0.
- ↑ "Release v3.0.0 · apify/crawlee" (in en). https://github.com/apify/crawlee/releases/tag/v3.0.0.
- ↑ "Announcing Crawlee for Python: Now you can use Python to build reliable web crawlers | Crawlee · Build reliable crawlers. Fast." (in en). 5 July 2024. https://crawlee.dev/blog/launching-crawlee-python.
Original source: https://en.wikipedia.org/wiki/Crawlee.
Read more |