Skip to content

pholcusvsscrapy

Apache-2.0 7 1 7,284
Feb 15 2020 v1.3.4(3 years ago)
46,263 30 745 BSD
2.8.0(22 days ago) Jul 26 2019 948.3 thousand (month)

Pholcus is a minimalistic web crawler library written in the Go programming language. It is designed to be flexible and easy to use, and it supports concurrent, distributed, and modular crawling.

Note that Pholcus is documented and maintained in the Chinese language and has no english resources other than the code source itself.

Scrapy is an open-source Python library for web scraping. It allows developers to extract structured data from websites using a simple and consistent interface.

Scrapy provides:

  • A built-in way to follow links and extract data from multiple pages (crawling)
  • Handling common web scraping tasks such as logging in, handling cookies, and handling redirects.

Scrapy is built on top of the Twisted networking engine, which provides a non-blocking way to handle multiple requests at the same time, allowing Scrapy to efficiently scrape large websites.

It also comes with a built-in mechanism for handling common web scraping problems, such as:

  • handling HTTP errors
  • handling broken links

Scrapy also provide these features:

  • Support for storing scraped data in various formats, such as CSV, JSON, and XML.
  • Built-in support for selecting and extracting data using XPath or CSS selectors (through parsel).
  • Built-in support for handling common web scraping problems (like deduplication and url filtering).
  • Ability to easily extend its functionality using middlewares.
  • Ability to easily extend output processing using pipelines.

Highlights


popularcss-selectorsxpath-selectorscommunity-toolsoutput-pipelinesmiddlewaresasyncproductionlarge-scale

Example Use


package main

import (
    "github.com/henrylee2cn/pholcus/exec"
    _ "github.com/henrylee2cn/pholcus/spider/standard" // standard spider
)

func main() {
    // create spider object
    spider := exec.NewSpider(exec.NewTask("demo", "https://www.example.com"))
    // add a callback for URL route by regex pattern. In this case it's any route:
    spider.AddRule(".*", "Parse")
    // Start spider
    spider.Start()
}

// define callback here
func Parse(self *exec.Spider, doc *goquery.Document) {
    // callbacks receive HTMl document reference and 
}

Alternatives / Similar