Skip to content

dataflowkitvsayakashi

BSD-3-Clause 1 3 571
Feb 09 2017 2023-02-22(2 days ago)
160 1 8 AGPL-3.0-only
1.0.0-beta8.3(4 months ago) Apr 18 2019 159 (month)

Dataflow kit ("DFK") is a Web Scraping framework for Gophers. It extracts data from web pages, following the specified CSS Selectors. You can use it in many ways for data mining, data processing or archiving.

Web-scraping pipeline consists of 3 general components:

  • Downloading an HTML web-page. (Fetch Service)
  • Parsing an HTML page and retrieving data we're interested in (Parse Service)
  • Encoding parsed data to CSV, MS Excel, JSON, JSON Lines or XML format.

For fetching dataflowkit has several types of page fetchers:

  • Base fetcher uses standard golang http client to fetch pages as is. It works faster than Chrome fetcher. But Base fetcher cannot render dynamic javascript driven web pages.
  • Chrome fetcher is intended for rendering dynamic javascript based content. It sends requests to Chrome running in headless mode.

For parsing dataflowkit extracts data from downloaded web page following the rules listed in configuration JSON file. Extracted data is returned in CSV, MS Excel, JSON or XML format.

Some dataflowkit features:

  • Scraping of JavaScript generated pages;
  • Data extraction from paginated websites;
  • Processing infinite scrolled pages.
  • Sсraping of websites behind login form;
  • Cookies and sessions handling;
  • Following links and detailed pages processing;
  • Managing delays between requests per domain;
  • Following robots.txt directives;
  • Saving intermediate data in Diskv or Mongodb. Storage interface is flexible enough to add more storage types easily;
  • Encode results to CSV, MS Excel, JSON(Lines), XML formats;
  • Dataflow kit is fast. It takes about 4-6 seconds to fetch and then parse 50 pages.
  • Dataflow kit is suitable to process quite large volumes of data. Our tests show the time needed to parse appr. 4 millions of pages is about 7 hours.

Ayakashi is a web scraping library for Node.js that allows developers to easily extract structured data from websites. It is built on top of the popular "puppeteer" library and provides a simple and intuitive API for defining and querying the structure of a website.

Features:

  • Powerful querying and data models
    Ayakashi's way of finding things in the page and using them is done with props and domQL. Directly inspired by the relational database world (and SQL), domQL makes DOM access easy and readable no matter how obscure the page's structure is. Props are the way to package domQL expressions as re-usable structures which can then be passed around to actions or to be used as models for data extraction.
  • High level builtin actions
    Ready made actions so you can focus on what matters. Easily handle infinite scrolling, single page navigation, events and more. Plus, you can always build your own actions, either from scratch or by composing other actions.
  • Preload code on pages
    Need to include a bunch of code, a library you made or a 3rd party module and make it available on a page? Preloaders have you covered.

Example Use


Dataflowkit uses JSON configuration like:
{
  "name": "collection",
  "request": {
      "url": "https://example.com"
  },
  "fields": [
      {
          "name": "Title",
          "selector": ".product-container a",
          "extractor": {
              "types": [
                  "text",
                  "href"
              ],
              "filters": [
                  "trim",
                  "lowerCase"
              ],
              "params": {
                  "includeIfEmpty": false
              }
          }
      },
      {
          "name": "Image",
          "selector": "#product-container img",
          "extractor": {
              "types": [
                  "alt",
                  "src",
                  "width",
                  "height"
              ],
              "filters": [
                  "trim",
                  "upperCase"
              ]
          }
      },
      {
          "name": "Buyinfo",
          "selector": ".buy-info",
          "extractor": {
              "types": [
                  "text"
              ],
              "params": {
                  "includeIfEmpty": false
              }
          }
      }
  ],
  "paginator": {
      "selector": ".next",
      "attr": "href",
      "maxPages": 3
  },
  "format": "json",
  "fetcherType": "chrome",
  "paginateResults": false
}
which is then ingested through CLI command.
const ayakashi = require("ayakashi");
const myAyakashi = ayakashi.init();

// navigate the browser
await myAyakashi.goTo("https://example.com/product");

// parsing HTML
// first by defnining a selector
myAyakashi
    .select("productList")
    .where({class: {eq: "product-item"}});

// then executing selector on current HTML:
const productList = await myAyakashi.extract("productList");
console.log(productList);

Alternatives / Similar