roachvsphpscraper

MIT 13 1 1,121

134 (month) Dec 27 2021 2.0.1(6 days ago)

355 2 18 GPL-3.0-or-later

1.0.2(2 months ago) May 04 2020 112 (month)

Roach is a complete web scraping toolkit for PHP. It is heavily inspired by the popular Scrapy package for Python.

Roach allows us to define spiders that crawl and scrape web documents. Roach isn’t just a simple crawler, but includes an entire pipeline to clean, persist and otherwise process extracted data as well.

Just like scrapy, Roach supports: - Middlewares - Item Pipelines - Extendibility through Plugins

It’s your all-in-one resource for web scraping in PHP.

PHPScraper is a universal web-util for PHP. The main goal is to get stuff done instead of getting distracted with selectors, preparing & converting data structures, etc. Instead, you can just go to a website and get the relevant information for your project.

PHPScraper is a minimalistic scraper framework that is built on top of other popular scraping tools.

Features:

Direct access to page basic features like: Meta data, Links, Images, Headings, Content, Keywords etc.
File downloading.
RSS, Sitemap and other feed processing.
CSV, XML and JSON file processing.

Example Use

<?php

use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;

class RoachDocsSpider extends BasicSpider
{
    /**
     * @var string[]
     */
    public array $startUrls = [
        'https://roach-php.dev/docs/spiders'
    ];

    public function parse(Response $response): \Generator
    {
        $title = $response->filter('h1')->text();

        $subtitle = $response
            ->filter('main > div:nth-child(2) p:first-of-type')
            ->text();

        yield $this->item([
            'title' => $title,
            'subtitle' => $subtitle,
        ]);
    }
}

// create scraper object
$web = new \Spekulatius\PHPScraper\PHPScraper;
// go to URL
$web->go('https://test-pages.phpscraper.de/content/selectors.html');

// elements can be found using XPath:
echo $web->filter("//*[@id='by-id']")->text();   // "Content by ID"

// or pre-defined variables covering basic page data:
$web->links;  // for all links
$web->headings;
$web->images;
$web->contentKeywords;
$web->orderedLists;
$web->unorderedLists;
$web->paragraphs;
$web->outline;  // basic page outline
$web->cleanOutlineWithParagraphs;  // basic page outline

Alternatives / Similar

colly

18,958 compare

pholcus

7,284 compare

geziyor

1,857 compare

dataflowkit

571 compare

scrapy

46,263 compare

rvest

1,387 compare

ferret

5,227 compare

gocrawl

2,008 compare

node-crawler

6,334 compare

scrapyd

2,598 compare

panther

2,691 compare

autoscraper

4,911 compare

gracy

156 compare

scrapydweb

2,671 compare

wombat

1,281 compare

spidr

744 compare

ralger

145 compare

ruia

1,642 compare

photon

9,417 compare

gerapy

2,914 compare

ayakashi

160 compare

phpscraper

355 compare

php-spider

1,286 compare

dude

341 compare

crwlr-crawler

128 compare