Skip to content

ferretvsscrapy

Apache-2.0 49 7 5,227
58.1 thousand (month) Jul 27 2019 v0.17.0(16 days ago)
46,263 30 745 BSD
2.8.0(22 days ago) Jul 26 2019 948.3 thousand (month)

Ferret is a web scraping system. It aims to simplify data extraction from the web for UI testing, machine learning, analytics and more. ferret allows users to focus on the data. It abstracts away the technical details and complexity of underlying technologies using its own declarative language. It is extremely portable, extensible, and fast.

Features

  • Declarative language
  • Support of both static and dynamic web pages
  • Embeddable
  • Extensible

Ferret is always implemented in Python through pyfer

Scrapy is an open-source Python library for web scraping. It allows developers to extract structured data from websites using a simple and consistent interface.

Scrapy provides:

  • A built-in way to follow links and extract data from multiple pages (crawling)
  • Handling common web scraping tasks such as logging in, handling cookies, and handling redirects.

Scrapy is built on top of the Twisted networking engine, which provides a non-blocking way to handle multiple requests at the same time, allowing Scrapy to efficiently scrape large websites.

It also comes with a built-in mechanism for handling common web scraping problems, such as:

  • handling HTTP errors
  • handling broken links

Scrapy also provide these features:

  • Support for storing scraped data in various formats, such as CSV, JSON, and XML.
  • Built-in support for selecting and extracting data using XPath or CSS selectors (through parsel).
  • Built-in support for handling common web scraping problems (like deduplication and url filtering).
  • Ability to easily extend its functionality using middlewares.
  • Ability to easily extend output processing using pipelines.

Highlights


popularcss-selectorsxpath-selectorscommunity-toolsoutput-pipelinesmiddlewaresasyncproductionlarge-scale

Example Use


// Example scraper for Google in Ferret:
LET google = DOCUMENT("https://www.google.com/", {
    driver: "cdp",
    userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"
})

HOVER(google, 'input[name="q"]')
WAIT(RAND(100))
INPUT(google, 'input[name="q"]', @criteria, 30)
WAIT(RAND(100))
CLICK(google, 'input[name="btnK"]')

WAITFOR EVENT "navigation" IN google

WAIT_ELEMENT(google, "#res")

LET results = ELEMENTS(google, X("//*[text() = 'Search Results']/following-sibling::*/*"))

FOR el IN results
    RETURN {
        title: INNER_TEXT(el, 'h3')?,
        description: INNER_TEXT(el, X("//em/parent::*")),
        url: ELEMENT(el, 'a')?.attributes.href
    }

Alternatives / Similar