R-bloggers.com

Web Scraping Exercises

2016-12-20

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

[For this exercise, before proceeding, first read the rvest package help and the selectorgadget help.]

Answers to the exercises are available here.

Exercise 1

Consider the url ‘http://statbel.fgov.be/en/statistics/figures/economy/indicators/prix_prod_con/’

Extract all the information load on table ‘Third Quarter 2016’.

Exercise 2

Consider the url ‘http://www2.sas.com/proceedings/sugi30/toc.html’

Extract all the papers names, from 001-30 to 268-30

Exercise 3

Consider the url ‘http://www.gibbon.se/Retailer/Map.aspx?SectionId=832’

Extract all the options (countries) availables on select button.

Exercise 4

Consider the url ‘http://r-exercises.com/start-here-to-learn-r/’

Extract all the topics available on the url.

Exercise 5

Consider the url ‘http://www.immobiliare.it/Roma/agenzie_immobiliari_provincia-Roma.html’

Extract all inmobiliaries names published on first page.

Exercise 6

Consider the url ‘http://www.gibbon.se/Retailer/Map.aspx?SectionId=832’.

Extract the links to the detailed information of each row on the table.

For example, for the first adress, Karlbergsvägen 32, 113 27 stockholm, the details are

A.E.N HUND I STAN AB

ADRESS OCH ÖPPETTIDER

Karlbergsvägen 32

113 27 STOCKHOLM

Öppettider:

Telefon: 08-313058

Mail-adress: info@hundistan.eu

Hemsida:

The link to that details (clicking on Karlbergsvägen 32, 113 27 stockholm) is http://www.gibbon.se/Retailer/Retailer.aspx?ItemId=45128.

You have to extract all the links available, one per row.

Exercise 7

Consider the url ‘https://www.bkk-klinikfinder.de/suche/suchergebnis.php?next=1’

Extract the links to the detailed information of each hospital. For example, for the hospital

Krankenhaus Dresden-Friedrichstadt Städtisches Klinikum, the details are available on the link:

https://www.bkk-klinikfinder.de/krankenhaus/index.php?id=26140094900

Exercise 8

Consider the url scraped in Exercise 7.

Extract the links to ‘Details’ for each hospital display on the first 4 pages.

Exercise 9

Consider the url=’http://www.dictionary.com/browse/’ and the words ‘handy’,’whisper’,’lovely’,’scrape’.

Build a data frame, where the first variables is “Word” and the second variables is “definitions”. Scrape the definitions from the url.

Exercise 10

Consider the url ‘http://www.gibbon.se/Retailer/Map.aspx?SectionId=832’.

Build a data frame with all the information available for each row.

For example, for the first adress, Karlbergsvägen 32, 113 27 stockholm, the details are

A.E.N HUND I STAN AB

ADRESS OCH ÖPPETTIDER

Karlbergsvägen 32

113 27 STOCKHOLM

Öppettider:

Telefon: 08-313058

Mail-adress: info@hundistan.eu

Hemsida:

For the second row, Inedalsgatan 5, 112 33 stockholm, the details are

ARKENZOO KUNGSHOLMEN A

ADRESS OCH ÖPPETTIDER

Kungs Zoo AB

Inedalsgatan 5

112 33 STOCKHOLM

Öppettider:

Telefon: 08-7248110

Mail-adress: kungsholmen@arkenzoo.se

Hemsida: www.arkenzoo.se

This details will be saved on the first row of the data.frame.

Website address Name of store Phone Number Email adress City Country

1 A.E.N Hund i Stan AB 08-313058 info@hundistan.eu Stocholm Sweden

2 www.arkenzoo.se ArkenZoo Kungsholmen A 08-7248110 kungsholmen@arkenzoo.se Stocholm Sweden

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...