tm4ss
diff --git a/‎README.md
+17-3 b/‎README.md
+17-3
diff --git a/‎Tutorial_2_Web_crawling.Rmd renamed to ‎Tutorial_1_Web_scraping.Rmd
+9-8 b/‎Tutorial_2_Web_crawling.Rmd renamed to ‎Tutorial_1_Web_scraping.Rmd
+9-8
diff --git a/‎Tutorial_1_Read_textdata.Rmd renamed to ‎Tutorial_2_Read_textdata.Rmd
+2-2 b/‎Tutorial_1_Read_textdata.Rmd renamed to ‎Tutorial_2_Read_textdata.Rmd
+2-2
diff --git a/‎Tutorial_3_Frequency.Rmd
+120-4 b/‎Tutorial_3_Frequency.Rmd
+120-4
diff --git a/‎Tutorial_4_Term_extraction.Rmd
+2-2 b/‎Tutorial_4_Term_extraction.Rmd
+2-2
@@ -4,12 +4,12 @@ This course consists of 8 tutorials written in R-markdown and further described
 
 You can use *knitr* to create the tutorial sheets as HTML notebooks from the [R-markdown source code](https://github.com/tm4ss/tm4ss.github.io).
 
-In the /docs folder, you have access to the **[rendered tutorials](https://tm4ss.github.io/docs)**.
+In the `/docs` folder, you have access to the **[rendered tutorials](https://tm4ss.github.io/docs)**.
 
 ## Tutorials
 
-1. Data import and web scraping
-2. Text as data
+1. Web crawling and scraping
+2. Text data import in R
 3. Frequency analysis
 4. Key term extraction
 5. Co-occurrence analysis
@@ -19,6 +19,20 @@ In the /docs folder, you have access to the **[rendered tutorials](https://tm4ss
 
 Click **[here for the rendered tutorials](https://tm4ss.github.io/docs)**.
 
+## Render from source
+
+Clone the repository
+
+```
+git clone https://github.com/tm4ss/tm4ss.github.io.git
+```
+
+Open the `Tutorials.Rproj` R-project file and run
+
+```
+rmarkdown::render_site(output_format = "html_document")
+```
+
 ## License & Citation
 
 This course was created by Gregor Wiedemann and Andreas Niekler. It was freely released under GPLv3 in September 2017. If you use (parts of) it for your own teaching or analysis, please cite
 
@@ -1,8 +1,10 @@
 ---
-title: "Tutorial 2: Web crawling and scraping"
+title: "Tutorial 1: Web crawling and scraping"
 author: "Andreas Niekler, Gregor Wiedemann"
 date: "`r format(Sys.time(), '%Y-%m-%d')`"
 output: 
+  pdf_document:
+    toc: yes
   html_document:
     toc: true
     theme: united
@@ -17,15 +19,15 @@ klippy::klippy()
 ```
 This tutorial covers how to extract and process text data from web pages or other documents for later analysis.
 The automated download of HTML pages is called **Crawling**. The extraction of the textual data and/or metadata (for example, article date, headlines, author names, article text) from the HTML source code (or the DOM document object model of the website) is called **Scraping**. For these tasks, we use the package "rvest".
-In a third exercise, we will extract text data from various formats such as PDF, DOC, DOCX and TXT files with the "readtext" package.
 
 1. Download a single web page and extract its content
-2. Extract links from a overview page and extract articles
-3. Extract text data from PDF and other formats on disk
+2. Extract links from a overview page 
+3. Extract all articles to corresponding links from step 2
+
 
 # Preparation
 
-Create a new R script (File -> New File -> R Script) named "Tutorial_2.R". In this script you will enter and execute all commands. If you want to run the complete script in RStudio, you can use Ctrl-A to select the complete source code and execute with Ctrl-Return. If you want to execute only one line, you can simply press Ctrl-Return on the respective line. If you want to execute a block of several lines, select the block and press Ctrl-Return.
+Create a new R script (File -> New File -> R Script) named "Tutorial_1.R". In this script you will enter and execute all commands. If you want to run the complete script in RStudio, you can use Ctrl-A to select the complete source code and execute with Ctrl-Return. If you want to execute only one line, you can simply press Ctrl-Return on the respective line. If you want to execute a block of several lines, select the block and press Ctrl-Return.
 
 Tip: Copy individual sections of the source code directly into the console (2) and run it step by step. Get familiar with the function calls included in the Help function.
 
@@ -39,7 +41,7 @@ options(stringsAsFactors = F)
 getwd()
 ```
 
-# Prepare scraping of dynamic web pages
+# Scraping of dynamic web pages
 
 Modern websites often do not contain the full content displayed in the browser in their corresponding source files which are served by the webserver. Instead, the browser loads additional content dynamically via javascript code contained in the original source file. To be able to scrape such content, we rely on a headless browser "phantomJS" which renders a site for a given URL for us, before we start the actual scraping, i.e. the extraction of certain identifiable elements from the rendered site. 
 
@@ -75,8 +77,6 @@ A convenient method to download and parse a webpage provides the function `read_
 
 To make sure that we get the dynamically rendered HTML content of the website, we pass the original source code dowloaded from the URL to our PhantomJS session first, and the use the rendered source.
 
-*NOTICE*: In case the website does not fetch or alter the to-be-scraped content dynamically, you can omit the PhantomJS webdriver and just download the the static HTML source code to retrieve the information from there. In this case, replace the following block of code with a simple call of `html_document <- read_html(url)` where the `read_html()` function downloads the page source for you. 
-
 ```{r}
 # load URL to phantomJS session
 pjs_session$go(url)
@@ -86,6 +86,7 @@ rendered_source <- pjs_session$getSource()
 html_document <- read_html(rendered_source)
 ```
 
+*NOTICE*: In case the website does not fetch or alter the to-be-scraped content dynamically, you can omit the PhantomJS webdriver and just download the the static HTML source code to retrieve the information from there. In this case, replace the following block of code with a simple call of `html_document <- read_html(url)` where the `read_html()` function downloads the unrendered page source code directly. 
 
 ## Scrape information from XHTML
 
 
@@ -1,5 +1,5 @@
 ---
-title: 'Tutorial 1: Processing of textual data'
+title: 'Tutorial 2: Processing of textual data'
 author: "Andreas Niekler, Gregor Wiedemann"
 date: "`r format(Sys.time(), '%Y-%m-%d')`"
 output:
@@ -23,7 +23,7 @@ In this tutorial, we demonstrate how to read text data in R, tokenize texts and
 2. From text to a corpus,
 3. Create a document-term matrix and investigate Zipf's law
 
-First, let's create a new R Project (File -> New Project -> Existing directory) in the provided tutorial folder. Then we create a new R File (File -> New File -> R script) and save it as "Tutorial_1.R".
+First, let's create a new R Project (File -> New Project -> Existing directory) in the provided tutorial folder. Then we create a new R File (File -> New File -> R script) and save it as "Tutorial_2.R".
 
 # Reading txt, pdf, html, docx, ...
 
 
@@ -182,10 +182,10 @@ The standard output is sorted by president's names alphabetically. We can make u
 
 ```{r buildTS6, warning=F}
 # order by positive sentiments
-ggplot(data = df, aes(x = reorder(president, df$value, head, 1), y = value, fill = variable)) + geom_bar(stat="identity", position=position_dodge()) + coord_flip()
+ggplot(data = df, aes(x = reorder(president, value, head, 1), y = value, fill = variable)) + geom_bar(stat="identity", position=position_dodge()) + coord_flip()
 
 # order by negative sentiments
-ggplot(data = df, aes(x = reorder(president, df$value, tail, 1), y = value, fill = variable)) + geom_bar(stat="identity", position=position_dodge()) + coord_flip()
+ggplot(data = df, aes(x = reorder(president, value, tail, 1), y = value, fill = variable)) + geom_bar(stat="identity", position=position_dodge()) + coord_flip()
 ```
 
 # Heatmaps
@@ -207,9 +207,125 @@ heatmap(t(DTM_reduced), Colv=NA, col = rev(heat.colors(256)), keep.dendro= FALSE
 
 # Optional exercises
 
-1. Run the time series analysis with the terms "environment", "climate", "planet", "space".
+1. Create the time series plot with the terms "environment", "climate", "planet", "space" as shown above. Then, try to use the  ggplot2 library for the line plot (e.g. the function `geom_line()`).
+
+```{r ex1, echo=F, results='hide', message=FALSE, warning=FALSE}
+# code from above
+terms_to_observe <- c("environment", "climate", "planet", "space")
+DTM_reduced <- as.matrix(DTM[, terms_to_observe])
+counts_per_decade <- aggregate(DTM_reduced, by = list(decade = textdata$decade), sum)
+
+# ggplot2 version
+df <- melt(counts_per_decade, id.vars = "decade")
+ggplot(data = df, aes(x = decade, y = value, group=variable, color = variable)) + 
+  geom_line()
+
+```
+
 2. Use a different relative measure for the sentiment analysis: Instead computing the proportion of positive/negative terms regarding all terms, compute the share of positive/negative terms regarding all sentiment terms only.
+
+```{r ex2, echo=F, results='hide', message=FALSE, warning=FALSE}
+relative_sentiment_frequencies <- data.frame(
+  positive = counts_positive / (counts_positive + counts_negative),
+  negative = counts_negative / (counts_positive + counts_negative)
+)
+sentiments_per_president <- aggregate(relative_sentiment_frequencies, by = list(president = textdata$president), mean)
+df <- melt(sentiments_per_president, id.vars = "president")
+ggplot(data = df, aes(x = reorder(president, value, head, 1), y = value, fill = variable)) + geom_bar(stat="identity", position="stack") + coord_flip()
+```
+
+
 3. The AFINN sentiment lexicon provides not only negative/positive terms, but also a valence value for each term ranging from [-5;+5]. Instead of counting sentiment terms only, use the valence values for sentiment scoring.
-4. Draw a `heatmap` of the terms "i", "you", "he", "she", "we", "they" aggregated per president. Caution: you need to modify the preprocessing in a certain way!
+
+```{r ex3, echo=F, results='hide', message=FALSE, warning=FALSE}
+corpus_afinn <- sotu_corpus %>%
+  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols =
+           TRUE) %>%
+  tokens_tolower() %>%
+  tokens_remove(pattern = stopwords())
+
+# AFINN sentiment lexicon by Nielsen 2011
+afinn_terms <- read.csv("data/AFINN-111.txt", header = F, sep = "\t")
+
+pos_idx <- afinn_terms$V2 > 0
+positive_terms_score <- afinn_terms$V2[pos_idx]
+names(positive_terms_score) <- afinn_terms$V1[pos_idx]
+
+neg_idx <- afinn_terms$V2 < 0
+negative_terms_score <- afinn_terms$V2[neg_idx] * -1
+names(negative_terms_score) <- afinn_terms$V1[neg_idx]
+
+
+pos_DTM <- corpus_afinn %>%
+  tokens_keep(names(positive_terms_score)) %>%
+  dfm()
+positive_terms_score <- positive_terms_score[colnames(pos_DTM)]
+# caution: to multiply all rows of a matrix with a vector of
+#ncol(matrix) length
+# you need to transpose the left matrix and then the result again
+pos_DTM <- t(t(as.matrix(pos_DTM)) * positive_terms_score)
+counts_positive <- rowSums(pos_DTM)
+
+neg_DTM <- corpus_afinn %>%
+  tokens_keep(names(negative_terms_score)) %>%
+  dfm()
+negative_terms_score <- negative_terms_score[colnames(neg_DTM)]
+# caution: to multiply all rows of a matrix with a vector of ncol(matrix) length
+# you need to transpose the left matrix and then the result again
+neg_DTM <- t(t(as.matrix(neg_DTM)) * negative_terms_score)
+counts_negative <- rowSums(neg_DTM)
+
+counts_all_terms <- corpus_afinn %>% dfm() %>% rowSums()
+
+relative_sentiment_frequencies <- data.frame(
+  positive = counts_positive / (counts_positive + counts_negative),
+  negative = counts_negative / (counts_positive + counts_negative)
+)
+
+sentiments_per_president <- aggregate(
+  relative_sentiment_frequencies, 
+  by = list(president = textdata$president), 
+  mean)
+
+head(sentiments_per_president)
+
+df <- melt(sentiments_per_president, id.vars = "president")
+# order by positive sentiments
+ggplot(data = df, aes(x = reorder(president, value, head, 1), y = value, fill = variable)) + geom_bar(stat="identity", position="stack") + coord_flip()
+```
+
+
+4. Draw a heatmap of the terms "i", "you", "he", "she", "we", "they" aggregated per president. Caution: you need to modify the preprocessing in a certain way! Also consider setting the parameter `scale='none'` when calling the `heatmap` function.
+
+```{r ex4, echo=F, results='hide', message=FALSE, warning=FALSE}
+# do not use stop word removal!
+DTM <- sotu_corpus %>%
+  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols =
+           TRUE) %>%
+  tokens_tolower() %>%
+  dfm()
+
+# aggregate relative counts per president
+terms_to_observe <- c("i", "you", "he", "she", "we", "they")
+DTM_reduced <- as.matrix(DTM[, terms_to_observe])
+abs_counts_per_president <- aggregate(
+  DTM_reduced, 
+  by = list(president = textdata$president), 
+  sum)
+lengths_speeches_per_president <- aggregate(
+  rowSums(DTM), 
+  by = list(president = textdata$president), 
+  sum)
+rel_counts_per_president <- abs_counts_per_president[, -1] / lengths_speeches_per_president[, -1]
+rownames(rel_counts_per_president) <- abs_counts_per_president$president
+
+# temporal re-ordering
+temporally_ordered_presidents <- unique(textdata$president)
+rel_counts_per_president <- rel_counts_per_president[temporally_ordered_presidents, ]
+
+# plot
+heatmap(t(rel_counts_per_president), Colv=NA, col = rev(heat.colors(256)),
+        keep.dendro= FALSE, margins = c(5, 10), scale = "none")
+```
 
 # References
@@ -93,7 +93,7 @@ Let us compute TF-IDF weights for all terms in the first speech of Barack Obama.
 # Compute IDF: log(N / n_i)
 number_of_docs <- nrow(DTM)
 term_in_docs <- colSums(DTM > 0)
-idf <- log2(number_of_docs / term_in_docs)
+idf <- log(number_of_docs / term_in_docs)
 
 # Compute TF
 first_obama_speech <- which(textdata$president == "Barack Obama")[1]
@@ -303,7 +303,7 @@ for (president in presidents) {
 source("calculateLogLikelihood.R")
 
 frq <- sort(colSums(targetDTM), decreasing = T)[1:25]
-tfidf <- sort(colSums(targetDTM) * log2(nrow(targetDTM) / colSums(targetDTM > 0)), decreasing = T)[1:25]
+tfidf <- sort(colSums(targetDTM) * log(nrow(targetDTM) / colSums(targetDTM > 0)), decreasing = T)[1:25]
 ll <- sort(calculateLogLikelihood(colSums(targetDTM), colSums(comparisonDTM)), decreasing = T)[1:25]
 
 df <- data.frame(