You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This course was created by Gregor Wiedemann and Andreas Niekler. It was freely released under GPLv3 in September 2017. If you use (parts of) it for your own teaching or analysis, please cite
Copy file name to clipboardExpand all lines: Tutorial_1_Web_scraping.Rmd
+9-8
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,10 @@
1
1
---
2
-
title: "Tutorial 2: Web crawling and scraping"
2
+
title: "Tutorial 1: Web crawling and scraping"
3
3
author: "Andreas Niekler, Gregor Wiedemann"
4
4
date: "`r format(Sys.time(), '%Y-%m-%d')`"
5
5
output:
6
+
pdf_document:
7
+
toc: yes
6
8
html_document:
7
9
toc: true
8
10
theme: united
@@ -17,15 +19,15 @@ klippy::klippy()
17
19
```
18
20
This tutorial covers how to extract and process text data from web pages or other documents for later analysis.
19
21
The automated download of HTML pages is called **Crawling**. The extraction of the textual data and/or metadata (for example, article date, headlines, author names, article text) from the HTML source code (or the DOM document object model of the website) is called **Scraping**. For these tasks, we use the package "rvest".
20
-
In a third exercise, we will extract text data from various formats such as PDF, DOC, DOCX and TXT files with the "readtext" package.
21
22
22
23
1. Download a single web page and extract its content
23
-
2. Extract links from a overview page and extract articles
24
-
3. Extract text data from PDF and other formats on disk
24
+
2. Extract links from a overview page
25
+
3. Extract all articles to corresponding links from step 2
26
+
25
27
26
28
# Preparation
27
29
28
-
Create a new R script (File -> New File -> R Script) named "Tutorial_2.R". In this script you will enter and execute all commands. If you want to run the complete script in RStudio, you can use Ctrl-A to select the complete source code and execute with Ctrl-Return. If you want to execute only one line, you can simply press Ctrl-Return on the respective line. If you want to execute a block of several lines, select the block and press Ctrl-Return.
30
+
Create a new R script (File -> New File -> R Script) named "Tutorial_1.R". In this script you will enter and execute all commands. If you want to run the complete script in RStudio, you can use Ctrl-A to select the complete source code and execute with Ctrl-Return. If you want to execute only one line, you can simply press Ctrl-Return on the respective line. If you want to execute a block of several lines, select the block and press Ctrl-Return.
29
31
30
32
Tip: Copy individual sections of the source code directly into the console (2) and run it step by step. Get familiar with the function calls included in the Help function.
31
33
@@ -39,7 +41,7 @@ options(stringsAsFactors = F)
39
41
getwd()
40
42
```
41
43
42
-
# Prepare scraping of dynamic web pages
44
+
# Scraping of dynamic web pages
43
45
44
46
Modern websites often do not contain the full content displayed in the browser in their corresponding source files which are served by the webserver. Instead, the browser loads additional content dynamically via javascript code contained in the original source file. To be able to scrape such content, we rely on a headless browser "phantomJS" which renders a site for a given URL for us, before we start the actual scraping, i.e. the extraction of certain identifiable elements from the rendered site.
45
47
@@ -75,8 +77,6 @@ A convenient method to download and parse a webpage provides the function `read_
75
77
76
78
To make sure that we get the dynamically rendered HTML content of the website, we pass the original source code dowloaded from the URL to our PhantomJS session first, and the use the rendered source.
77
79
78
-
*NOTICE*: In case the website does not fetch or alter the to-be-scraped content dynamically, you can omit the PhantomJS webdriver and just download the the static HTML source code to retrieve the information from there. In this case, replace the following block of code with a simple call of `html_document <- read_html(url)` where the `read_html()` function downloads the page source for you.
*NOTICE*: In case the website does not fetch or alter the to-be-scraped content dynamically, you can omit the PhantomJS webdriver and just download the the static HTML source code to retrieve the information from there. In this case, replace the following block of code with a simple call of `html_document <- read_html(url)` where the `read_html()` function downloads the unrendered page source code directly.
Copy file name to clipboardExpand all lines: Tutorial_2_Read_textdata.Rmd
+2-2
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
title: 'Tutorial 1: Processing of textual data'
2
+
title: 'Tutorial 2: Processing of textual data'
3
3
author: "Andreas Niekler, Gregor Wiedemann"
4
4
date: "`r format(Sys.time(), '%Y-%m-%d')`"
5
5
output:
@@ -23,7 +23,7 @@ In this tutorial, we demonstrate how to read text data in R, tokenize texts and
23
23
2. From text to a corpus,
24
24
3. Create a document-term matrix and investigate Zipf's law
25
25
26
-
First, let's create a new R Project (File -> New Project -> Existing directory) in the provided tutorial folder. Then we create a new R File (File -> New File -> R script) and save it as "Tutorial_1.R".
26
+
First, let's create a new R Project (File -> New Project -> Existing directory) in the provided tutorial folder. Then we create a new R File (File -> New File -> R script) and save it as "Tutorial_2.R".
1. Run the time series analysis with the terms "environment", "climate", "planet", "space".
210
+
1. Create the time series plot with the terms "environment", "climate", "planet", "space" as shown above. Then, try to use the ggplot2 library for the line plot (e.g. the function `geom_line()`).
counts_per_decade <- aggregate(DTM_reduced, by = list(decade = textdata$decade), sum)
217
+
218
+
# ggplot2 version
219
+
df <- melt(counts_per_decade, id.vars = "decade")
220
+
ggplot(data = df, aes(x = decade, y = value, group=variable, color = variable)) +
221
+
geom_line()
222
+
223
+
```
224
+
211
225
2. Use a different relative measure for the sentiment analysis: Instead computing the proportion of positive/negative terms regarding all terms, compute the share of positive/negative terms regarding all sentiment terms only.
ggplot(data = df, aes(x = reorder(president, value, head, 1), y = value, fill = variable)) + geom_bar(stat="identity", position="stack") + coord_flip()
235
+
```
236
+
237
+
212
238
3. The AFINN sentiment lexicon provides not only negative/positive terms, but also a valence value for each term ranging from [-5;+5]. Instead of counting sentiment terms only, use the valence values for sentiment scoring.
213
-
4. Draw a `heatmap` of the terms "i", "you", "he", "she", "we", "they" aggregated per president. Caution: you need to modify the preprocessing in a certain way!
ggplot(data = df, aes(x = reorder(president, value, head, 1), y = value, fill = variable)) + geom_bar(stat="identity", position="stack") + coord_flip()
295
+
```
296
+
297
+
298
+
4. Draw a heatmap of the terms "i", "you", "he", "she", "we", "they" aggregated per president. Caution: you need to modify the preprocessing in a certain way! Also consider setting the parameter `scale='none'` when calling the `heatmap` function.
0 commit comments