-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathscrape.Rmd
57 lines (44 loc) · 1.55 KB
/
scrape.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
title: "NHL Stanley Cup Playoff Predictions"
author: "Ilari Scheinin"
output: html_document
---
Previous: [README](https://github.com/ilarischeinin/stanley)
Next: [2. Process data](process.html)
1. Scrape raw data
------------------
[nhl.com](http://www.nhl.com) has play-by-play data available starting from
season 2002--2003, for both regular season and playoff games. To scrape it, I
am using the [nhlscrapr](https://github.com/acthomasca/nhlscrapr)
[package](http://cran.r-project.org/web/packages/nhlscrapr/). It has a single
command `compile.all.games()`, which downloads and compiles everything together.
However, it waits 20 seconds between every game, and therefore takes more than
3.5 days process all 12 seasons available. Instead, one might want to use
something like the two options below to download games one-by-one or by season,
and to set a shorter time interval.
```{r}
suppressMessages({
library(nhlscrapr)
})
compile.all.games()
```
In order to download games one-by-one or by season, these approaches can be
used.
```{r, eval=FALSE}
# get full list of games available
games <- full.game.database()
# download by game
apply(games, 1, function(game) {
download.single.game(season=game["season"], gcode=game["gcode"], wait=2)
gc()
})
# download by season
lapply(unique(games$season), function(season) {
download.games(games[games$season == season, ], wait=2)
gc()
})
# and once downloaded, compile everything together
compile.all.games()
```
Next: [2. Process data](process.html)
Previous: [README](https://github.com/ilarischeinin/stanley)