Skip to content

ralgervscrwlr-crawler

MIT 3 1 145
1.1 thousand (month) Dec 22 2019 2.2.4(2 years ago)
128 1 - MIT
v1.0.0(15 days ago) Apr 18 2022 1 (month)

ralger is a small web scraping framework for R based on rvest and xml2.

It's goal to simplify basic web scraping and it provides a convenient and easy to use API.

It offers functions for retrieving pages, parsing HTML using CSS selectors, automatic table parsing and auto link, title, image and paragraph extraction.

This library provides kind of a framework and a lot of ready to use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with.

Some features: - Crawler Politeness innocent (respecting robots.txt, throttling,...) - Load URLs using - a (PSR-18) HTTP client (default is of course Guzzle) - or a headless browser (chrome) to get source after Javascript execution - Get absolute links from HTML documents link - Get sitemaps from robots.txt and get all URLs from those sitemaps - Crawl (load) all pages of a website spider - Use cookies (or don't) cookie - Use any HTTP methods (GET, POST,...) and send any headers or body - Iterate over paginated list pages repeat - Extract data from: - HTML and also XML (using CSS selectors or XPath queries) - JSON (using dot notation) - CSV (map columns) - Extract schema.org structured data in JSON-LD format from HTML documents - Keep memory usage low by using PHP Generators muscle - Cache HTTP responses during development, so you don't have to load pages again and again after every code change - Get logs about what your crawler is doing (accepts any PSR-3 LoggerInterface)

Example Use


library("ralger")

url <- "http://www.shanghairanking.com/rankings/arwu/2021"

# retrieve HTML and select elements using CSS selectors:
best_uni <- scrap(link = url, node = "a span", clean = TRUE)
head(best_uni, 5)
#>  [1] "Harvard University"
#>  [2] "Stanford University"
#>  [3] "University of Cambridge"
#>  [4] "Massachusetts Institute of Technology (MIT)"
#>  [5] "University of California, Berkeley"

# ralger can also parse HTML attributes
attributes <- attribute_scrap(
  link = "https://ropensci.org/",
  node = "a", # the a tag
  attr = "class" # getting the class attribute
)

head(attributes, 10) # NA values are a tags without a class attribute
#>  [1] "navbar-brand logo" "nav-link"          NA
#>  [4] NA                  NA                  "nav-link"
#>  [7] NA                  "nav-link"          NA
#> [10] NA
#

# ralger can automatically scrape tables:
data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW")

head(data)
#> # A tibble: 6 × 4
#>    Rank Title                                      `Lifetime Gross`  Year
#>   <int> <chr>                                      <chr>            <int>
#> 1     1 Avatar                                     $2,847,397,339    2009
#> 2     2 Avengers: Endgame                          $2,797,501,328    2019
#> 3     3 Titanic                                    $2,201,647,264    1997
#> 4     4 Star Wars: Episode VII - The Force Awakens $2,069,521,700    2015
#> 5     5 Avengers: Infinity War                     $2,048,359,754    2018
#> 6     6 Spider-Man: No Way Home                    $1,901,216,740    2021
<?php
require_once 'vendor/autoload.php';

use Crwlr\Crawler;

$crawler = new Crawler();
$crawler->get('https://example.com', ['User-Agent' => 'webscraping.fyi']);


// more links can be followed:
$crawler->followLinks();

// and current page can be parsed:
$response = $crawler->response();
$title = $crawler->filter('title')->text();
echo $response->getContent();
```

Alternatives / Similar