crwlr-crawlervsphpscraper
This library provides kind of a framework and a lot of ready to use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with.
Some features: - Crawler Politeness innocent (respecting robots.txt, throttling,...) - Load URLs using - a (PSR-18) HTTP client (default is of course Guzzle) - or a headless browser (chrome) to get source after Javascript execution - Get absolute links from HTML documents link - Get sitemaps from robots.txt and get all URLs from those sitemaps - Crawl (load) all pages of a website spider - Use cookies (or don't) cookie - Use any HTTP methods (GET, POST,...) and send any headers or body - Iterate over paginated list pages repeat - Extract data from: - HTML and also XML (using CSS selectors or XPath queries) - JSON (using dot notation) - CSV (map columns) - Extract schema.org structured data in JSON-LD format from HTML documents - Keep memory usage low by using PHP Generators muscle - Cache HTTP responses during development, so you don't have to load pages again and again after every code change - Get logs about what your crawler is doing (accepts any PSR-3 LoggerInterface)
PHPScraper is a universal web-util for PHP. The main goal is to get stuff done instead of getting distracted with selectors, preparing & converting data structures, etc. Instead, you can just go to a website and get the relevant information for your project.
PHPScraper is a minimalistic scraper framework that is built on top of other popular scraping tools.
Features:
- Direct access to page basic features like: Meta data, Links, Images, Headings, Content, Keywords etc.
- File downloading.
- RSS, Sitemap and other feed processing.
- CSV, XML and JSON file processing.
Example Use
<?php
require_once 'vendor/autoload.php';
use Crwlr\Crawler;
$crawler = new Crawler();
$crawler->get('https://example.com', ['User-Agent' => 'webscraping.fyi']);
// more links can be followed:
$crawler->followLinks();
// and current page can be parsed:
$response = $crawler->response();
$title = $crawler->filter('title')->text();
echo $response->getContent();
</div>
<div class="lib-example" markdown>
```javascript
// create scraper object
$web = new \Spekulatius\PHPScraper\PHPScraper;
// go to URL
$web->go('https://test-pages.phpscraper.de/content/selectors.html');
// elements can be found using XPath:
echo $web->filter("//*[@id='by-id']")->text();   // "Content by ID"
// or pre-defined variables covering basic page data:
$web->links;  // for all links
$web->headings;
$web->images;
$web->contentKeywords;
$web->orderedLists;
$web->unorderedLists;
$web->paragraphs;
$web->outline;  // basic page outline
$web->cleanOutlineWithParagraphs;  // basic page outline