Skip to content

Commit 2428c66

Browse files
Gregor WiedemannGregor Wiedemann
Gregor Wiedemann
authored and
Gregor Wiedemann
committed
version 2020
1 parent 071c79d commit 2428c66

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

48 files changed

+3785
-1934
lines changed

README.md

+17-3
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,12 @@ This course consists of 8 tutorials written in R-markdown and further described
44

55
You can use *knitr* to create the tutorial sheets as HTML notebooks from the [R-markdown source code](https://github.com/tm4ss/tm4ss.github.io).
66

7-
In the /docs folder, you have access to the **[rendered tutorials](https://tm4ss.github.io/docs)**.
7+
In the `/docs` folder, you have access to the **[rendered tutorials](https://tm4ss.github.io/docs)**.
88

99
## Tutorials
1010

11-
1. Data import and web scraping
12-
2. Text as data
11+
1. Web crawling and scraping
12+
2. Text data import in R
1313
3. Frequency analysis
1414
4. Key term extraction
1515
5. Co-occurrence analysis
@@ -19,6 +19,20 @@ In the /docs folder, you have access to the **[rendered tutorials](https://tm4ss
1919

2020
Click **[here for the rendered tutorials](https://tm4ss.github.io/docs)**.
2121

22+
## Render from source
23+
24+
Clone the repository
25+
26+
```
27+
git clone https://github.com/tm4ss/tm4ss.github.io.git
28+
```
29+
30+
Open the `Tutorials.Rproj` R-project file and run
31+
32+
```
33+
rmarkdown::render_site(output_format = "html_document")
34+
```
35+
2236
## License & Citation
2337

2438
This course was created by Gregor Wiedemann and Andreas Niekler. It was freely released under GPLv3 in September 2017. If you use (parts of) it for your own teaching or analysis, please cite

Tutorial_2_Web_crawling.Rmd renamed to Tutorial_1_Web_scraping.Rmd

+9-8
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
11
---
2-
title: "Tutorial 2: Web crawling and scraping"
2+
title: "Tutorial 1: Web crawling and scraping"
33
author: "Andreas Niekler, Gregor Wiedemann"
44
date: "`r format(Sys.time(), '%Y-%m-%d')`"
55
output:
6+
pdf_document:
7+
toc: yes
68
html_document:
79
toc: true
810
theme: united
@@ -17,15 +19,15 @@ klippy::klippy()
1719
```
1820
This tutorial covers how to extract and process text data from web pages or other documents for later analysis.
1921
The automated download of HTML pages is called **Crawling**. The extraction of the textual data and/or metadata (for example, article date, headlines, author names, article text) from the HTML source code (or the DOM document object model of the website) is called **Scraping**. For these tasks, we use the package "rvest".
20-
In a third exercise, we will extract text data from various formats such as PDF, DOC, DOCX and TXT files with the "readtext" package.
2122

2223
1. Download a single web page and extract its content
23-
2. Extract links from a overview page and extract articles
24-
3. Extract text data from PDF and other formats on disk
24+
2. Extract links from a overview page
25+
3. Extract all articles to corresponding links from step 2
26+
2527

2628
# Preparation
2729

28-
Create a new R script (File -> New File -> R Script) named "Tutorial_2.R". In this script you will enter and execute all commands. If you want to run the complete script in RStudio, you can use Ctrl-A to select the complete source code and execute with Ctrl-Return. If you want to execute only one line, you can simply press Ctrl-Return on the respective line. If you want to execute a block of several lines, select the block and press Ctrl-Return.
30+
Create a new R script (File -> New File -> R Script) named "Tutorial_1.R". In this script you will enter and execute all commands. If you want to run the complete script in RStudio, you can use Ctrl-A to select the complete source code and execute with Ctrl-Return. If you want to execute only one line, you can simply press Ctrl-Return on the respective line. If you want to execute a block of several lines, select the block and press Ctrl-Return.
2931

3032
Tip: Copy individual sections of the source code directly into the console (2) and run it step by step. Get familiar with the function calls included in the Help function.
3133

@@ -39,7 +41,7 @@ options(stringsAsFactors = F)
3941
getwd()
4042
```
4143

42-
# Prepare scraping of dynamic web pages
44+
# Scraping of dynamic web pages
4345

4446
Modern websites often do not contain the full content displayed in the browser in their corresponding source files which are served by the webserver. Instead, the browser loads additional content dynamically via javascript code contained in the original source file. To be able to scrape such content, we rely on a headless browser "phantomJS" which renders a site for a given URL for us, before we start the actual scraping, i.e. the extraction of certain identifiable elements from the rendered site.
4547

@@ -75,8 +77,6 @@ A convenient method to download and parse a webpage provides the function `read_
7577

7678
To make sure that we get the dynamically rendered HTML content of the website, we pass the original source code dowloaded from the URL to our PhantomJS session first, and the use the rendered source.
7779

78-
*NOTICE*: In case the website does not fetch or alter the to-be-scraped content dynamically, you can omit the PhantomJS webdriver and just download the the static HTML source code to retrieve the information from there. In this case, replace the following block of code with a simple call of `html_document <- read_html(url)` where the `read_html()` function downloads the page source for you.
79-
8080
```{r}
8181
# load URL to phantomJS session
8282
pjs_session$go(url)
@@ -86,6 +86,7 @@ rendered_source <- pjs_session$getSource()
8686
html_document <- read_html(rendered_source)
8787
```
8888

89+
*NOTICE*: In case the website does not fetch or alter the to-be-scraped content dynamically, you can omit the PhantomJS webdriver and just download the the static HTML source code to retrieve the information from there. In this case, replace the following block of code with a simple call of `html_document <- read_html(url)` where the `read_html()` function downloads the unrendered page source code directly.
8990

9091
## Scrape information from XHTML
9192

Tutorial_1_Read_textdata.Rmd renamed to Tutorial_2_Read_textdata.Rmd

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: 'Tutorial 1: Processing of textual data'
2+
title: 'Tutorial 2: Processing of textual data'
33
author: "Andreas Niekler, Gregor Wiedemann"
44
date: "`r format(Sys.time(), '%Y-%m-%d')`"
55
output:
@@ -23,7 +23,7 @@ In this tutorial, we demonstrate how to read text data in R, tokenize texts and
2323
2. From text to a corpus,
2424
3. Create a document-term matrix and investigate Zipf's law
2525

26-
First, let's create a new R Project (File -> New Project -> Existing directory) in the provided tutorial folder. Then we create a new R File (File -> New File -> R script) and save it as "Tutorial_1.R".
26+
First, let's create a new R Project (File -> New Project -> Existing directory) in the provided tutorial folder. Then we create a new R File (File -> New File -> R script) and save it as "Tutorial_2.R".
2727

2828
# Reading txt, pdf, html, docx, ...
2929

Tutorial_3_Frequency.Rmd

+120-4
Original file line numberDiff line numberDiff line change
@@ -182,10 +182,10 @@ The standard output is sorted by president's names alphabetically. We can make u
182182

183183
```{r buildTS6, warning=F}
184184
# order by positive sentiments
185-
ggplot(data = df, aes(x = reorder(president, df$value, head, 1), y = value, fill = variable)) + geom_bar(stat="identity", position=position_dodge()) + coord_flip()
185+
ggplot(data = df, aes(x = reorder(president, value, head, 1), y = value, fill = variable)) + geom_bar(stat="identity", position=position_dodge()) + coord_flip()
186186
187187
# order by negative sentiments
188-
ggplot(data = df, aes(x = reorder(president, df$value, tail, 1), y = value, fill = variable)) + geom_bar(stat="identity", position=position_dodge()) + coord_flip()
188+
ggplot(data = df, aes(x = reorder(president, value, tail, 1), y = value, fill = variable)) + geom_bar(stat="identity", position=position_dodge()) + coord_flip()
189189
```
190190

191191
# Heatmaps
@@ -207,9 +207,125 @@ heatmap(t(DTM_reduced), Colv=NA, col = rev(heat.colors(256)), keep.dendro= FALSE
207207

208208
# Optional exercises
209209

210-
1. Run the time series analysis with the terms "environment", "climate", "planet", "space".
210+
1. Create the time series plot with the terms "environment", "climate", "planet", "space" as shown above. Then, try to use the ggplot2 library for the line plot (e.g. the function `geom_line()`).
211+
212+
```{r ex1, echo=F, results='hide', message=FALSE, warning=FALSE}
213+
# code from above
214+
terms_to_observe <- c("environment", "climate", "planet", "space")
215+
DTM_reduced <- as.matrix(DTM[, terms_to_observe])
216+
counts_per_decade <- aggregate(DTM_reduced, by = list(decade = textdata$decade), sum)
217+
218+
# ggplot2 version
219+
df <- melt(counts_per_decade, id.vars = "decade")
220+
ggplot(data = df, aes(x = decade, y = value, group=variable, color = variable)) +
221+
geom_line()
222+
223+
```
224+
211225
2. Use a different relative measure for the sentiment analysis: Instead computing the proportion of positive/negative terms regarding all terms, compute the share of positive/negative terms regarding all sentiment terms only.
226+
227+
```{r ex2, echo=F, results='hide', message=FALSE, warning=FALSE}
228+
relative_sentiment_frequencies <- data.frame(
229+
positive = counts_positive / (counts_positive + counts_negative),
230+
negative = counts_negative / (counts_positive + counts_negative)
231+
)
232+
sentiments_per_president <- aggregate(relative_sentiment_frequencies, by = list(president = textdata$president), mean)
233+
df <- melt(sentiments_per_president, id.vars = "president")
234+
ggplot(data = df, aes(x = reorder(president, value, head, 1), y = value, fill = variable)) + geom_bar(stat="identity", position="stack") + coord_flip()
235+
```
236+
237+
212238
3. The AFINN sentiment lexicon provides not only negative/positive terms, but also a valence value for each term ranging from [-5;+5]. Instead of counting sentiment terms only, use the valence values for sentiment scoring.
213-
4. Draw a `heatmap` of the terms "i", "you", "he", "she", "we", "they" aggregated per president. Caution: you need to modify the preprocessing in a certain way!
239+
240+
```{r ex3, echo=F, results='hide', message=FALSE, warning=FALSE}
241+
corpus_afinn <- sotu_corpus %>%
242+
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols =
243+
TRUE) %>%
244+
tokens_tolower() %>%
245+
tokens_remove(pattern = stopwords())
246+
247+
# AFINN sentiment lexicon by Nielsen 2011
248+
afinn_terms <- read.csv("data/AFINN-111.txt", header = F, sep = "\t")
249+
250+
pos_idx <- afinn_terms$V2 > 0
251+
positive_terms_score <- afinn_terms$V2[pos_idx]
252+
names(positive_terms_score) <- afinn_terms$V1[pos_idx]
253+
254+
neg_idx <- afinn_terms$V2 < 0
255+
negative_terms_score <- afinn_terms$V2[neg_idx] * -1
256+
names(negative_terms_score) <- afinn_terms$V1[neg_idx]
257+
258+
259+
pos_DTM <- corpus_afinn %>%
260+
tokens_keep(names(positive_terms_score)) %>%
261+
dfm()
262+
positive_terms_score <- positive_terms_score[colnames(pos_DTM)]
263+
# caution: to multiply all rows of a matrix with a vector of
264+
#ncol(matrix) length
265+
# you need to transpose the left matrix and then the result again
266+
pos_DTM <- t(t(as.matrix(pos_DTM)) * positive_terms_score)
267+
counts_positive <- rowSums(pos_DTM)
268+
269+
neg_DTM <- corpus_afinn %>%
270+
tokens_keep(names(negative_terms_score)) %>%
271+
dfm()
272+
negative_terms_score <- negative_terms_score[colnames(neg_DTM)]
273+
# caution: to multiply all rows of a matrix with a vector of ncol(matrix) length
274+
# you need to transpose the left matrix and then the result again
275+
neg_DTM <- t(t(as.matrix(neg_DTM)) * negative_terms_score)
276+
counts_negative <- rowSums(neg_DTM)
277+
278+
counts_all_terms <- corpus_afinn %>% dfm() %>% rowSums()
279+
280+
relative_sentiment_frequencies <- data.frame(
281+
positive = counts_positive / (counts_positive + counts_negative),
282+
negative = counts_negative / (counts_positive + counts_negative)
283+
)
284+
285+
sentiments_per_president <- aggregate(
286+
relative_sentiment_frequencies,
287+
by = list(president = textdata$president),
288+
mean)
289+
290+
head(sentiments_per_president)
291+
292+
df <- melt(sentiments_per_president, id.vars = "president")
293+
# order by positive sentiments
294+
ggplot(data = df, aes(x = reorder(president, value, head, 1), y = value, fill = variable)) + geom_bar(stat="identity", position="stack") + coord_flip()
295+
```
296+
297+
298+
4. Draw a heatmap of the terms "i", "you", "he", "she", "we", "they" aggregated per president. Caution: you need to modify the preprocessing in a certain way! Also consider setting the parameter `scale='none'` when calling the `heatmap` function.
299+
300+
```{r ex4, echo=F, results='hide', message=FALSE, warning=FALSE}
301+
# do not use stop word removal!
302+
DTM <- sotu_corpus %>%
303+
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols =
304+
TRUE) %>%
305+
tokens_tolower() %>%
306+
dfm()
307+
308+
# aggregate relative counts per president
309+
terms_to_observe <- c("i", "you", "he", "she", "we", "they")
310+
DTM_reduced <- as.matrix(DTM[, terms_to_observe])
311+
abs_counts_per_president <- aggregate(
312+
DTM_reduced,
313+
by = list(president = textdata$president),
314+
sum)
315+
lengths_speeches_per_president <- aggregate(
316+
rowSums(DTM),
317+
by = list(president = textdata$president),
318+
sum)
319+
rel_counts_per_president <- abs_counts_per_president[, -1] / lengths_speeches_per_president[, -1]
320+
rownames(rel_counts_per_president) <- abs_counts_per_president$president
321+
322+
# temporal re-ordering
323+
temporally_ordered_presidents <- unique(textdata$president)
324+
rel_counts_per_president <- rel_counts_per_president[temporally_ordered_presidents, ]
325+
326+
# plot
327+
heatmap(t(rel_counts_per_president), Colv=NA, col = rev(heat.colors(256)),
328+
keep.dendro= FALSE, margins = c(5, 10), scale = "none")
329+
```
214330

215331
# References

Tutorial_4_Term_extraction.Rmd

+2-2
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ Let us compute TF-IDF weights for all terms in the first speech of Barack Obama.
9393
# Compute IDF: log(N / n_i)
9494
number_of_docs <- nrow(DTM)
9595
term_in_docs <- colSums(DTM > 0)
96-
idf <- log2(number_of_docs / term_in_docs)
96+
idf <- log(number_of_docs / term_in_docs)
9797
9898
# Compute TF
9999
first_obama_speech <- which(textdata$president == "Barack Obama")[1]
@@ -303,7 +303,7 @@ for (president in presidents) {
303303
source("calculateLogLikelihood.R")
304304
305305
frq <- sort(colSums(targetDTM), decreasing = T)[1:25]
306-
tfidf <- sort(colSums(targetDTM) * log2(nrow(targetDTM) / colSums(targetDTM > 0)), decreasing = T)[1:25]
306+
tfidf <- sort(colSums(targetDTM) * log(nrow(targetDTM) / colSums(targetDTM > 0)), decreasing = T)[1:25]
307307
ll <- sort(calculateLogLikelihood(colSums(targetDTM), colSums(comparisonDTM)), decreasing = T)[1:25]
308308
309309
df <- data.frame(

0 commit comments

Comments
 (0)