Draft:Crawlee

{{AFC submission|d|v|u=Jindrich.bar|ns=118|decliner=Reading Beans|declinets=20241027140437|ts=20240926095241}}

{{Short description|Typescript/Python web-crawling library}}

{{Draft topics|internet-culture|software|computing|technology}}

{{AfC topic|stem}}

{{Infobox software

| name = Crawlee

| logo = File:Crawlee.svg

| screenshot =

| caption =

| collapsible =

| author =

| developer = Apify

| released = {{Start date|2022|07|13|df=yes}}

| discontinued =

| latest release version = {{wikidata|property|reference|edit|P348}}

| latest release date = {{start date and age|{{wikidata|qualifier|P348|P577}}}}

| latest preview version =

| latest preview date =

| programming language = Typescript, Python

| operating system = Windows, macOS, Linux

| platform =

| size =

| language =

| genre = Web crawler

| license = Apache License 2.0

}}

Crawlee is a free and open-source web-crawling and browser automation library developed by Apify. The original TypeScript version was first released in 2022, with a Python version added in 2024.

Crawlee's architecture is built around modular crawlers responsible for extracting data from websites.{{cite web |last1=Koekemoer |first1=Jakkie |title=Web Scraping with Crawlee: Step-By-Step Tutorial |url=https://brightdata.com/blog/web-data/web-scraping-with-crawlee |website=Bright Data}}. The library follows a declarative programming approach, where users define crawling logic through a structured set of rules. Crawlee uses queues to manage requests; for each request, a specific function is executed to extract data or perform further processing{{cite web |last1=Nechytailo |first1=Yelyzaveta |title=Crawlee Tutorial: Easy Web Scraping and Browser Automation |url=https://oxylabs.io/blog/crawlee-web-scraping-tutorial |website=oxylabs.io |language=en}}.

Crawlee supports both headless browser sessions (via Playwright and other browser automation software) and plain HTTP request-based scraping.

It also provides various web-scraping-related utilities, such as a sitemap parser{{cite web |title=Release v3.7.0 · apify/crawlee |url=https://github.com/apify/crawlee/releases/tag/v3.7.0 |website=GitHub |access-date=22 September 2024 |language=en}} or an automatic HTTP proxy manager.

Notable mentions of Crawlee's use in web-crawling projects include GPT Crawler by Builder.io{{cite web |title=BuilderIO/gpt-crawler: Crawl a site to generate knowledge files to create your own custom GPT from a URL |url=https://github.com/BuilderIO/gpt-crawler |website=GitHub |access-date=21 September 2024}} and various generative AI projects maintained by AWS Labs{{cite web |title=awslabs/generative-ai-cdk-constructs: AWS Generative AI CDK Constructs are sample implementations of AWS CDK for common generative AI patterns. |url=https://github.com/awslabs/generative-ai-cdk-constructs |website=GitHub |publisher=Amazon Web Services - Labs |access-date=21 September 2024 |date=20 September 2024}}.

History

The first stable TypeScript version was released in 2021 under the name Apify SDK{{cite web |title=Release v1.0.0 · apify/crawlee |url=https://github.com/apify/crawlee/releases/tag/v1.0.0 |website=GitHub |language=en}}. This version offered both the open-source crawling framework and the proprietary storage implementation for use on the Apify platform.

In 2022, version v3.0.0 was released{{cite web |title=Release v3.0.0 · apify/crawlee |url=https://github.com/apify/crawlee/releases/tag/v3.0.0 |website=GitHub |language=en}}, renaming the library to Crawlee. This update made Crawlee independent of the Apify Platform, moving most of the Apify-specific features into a separate package (also named Apify SDK).

In 2024, a beta version of Crawlee for Python was released{{cite web |title=Announcing Crawlee for Python: Now you can use Python to build reliable web crawlers {{!}} Crawlee · Build reliable crawlers. Fast. |url=https://crawlee.dev/blog/launching-crawlee-python |website=crawlee.dev |language=en |date=5 July 2024}}

References

{{reflist}}