-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathweek7-adam-smith-qq-morality-wealth.Rmd
387 lines (307 loc) · 17.2 KB
/
week7-adam-smith-qq-morality-wealth.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
---
title: "Adam Smith Against Himself? Wealth of Nations versus Moral Sentiment: A Textual Criticism"
author: "Bill Foote"
date: "12/5/2020"
output: html_document
---
```{r , include=FALSE, message = FALSE, warning = FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidytext)
library(tidyverse)
library(tidyverse)
library(tidytext)
library(gutenbergr)
library(tidygraph)
library(corrr)
library(igraph)
library(ggraph)
library(stringr)
library(widyr)
```
## Some background
Adam Smith is famous for the primal location of markets in political economic thought and practice. Markets are any kind of exchange at all for consideration. Some of these markets are for factors of production, including labor, but he mostly seemed to favor land as the primary factor and as an economy's capital base. Capital, that is, the Land, subordinates labor and thus relegates labor to a necessary but not sufficient role. Those who owned, even better yet, governed the use of capital seemed allowed, perhaps by fiat of the local prince, to use labor in any way necessary to deploy capital.
This preliminary investigation will use [Julia Silge's tidytext framework explicated here in her online version of her book.](https://www.tidytextmining.com/tidytext.html) These techniques belong in large part to the methodology of text criticism. They attempt to identify similarities and differences in versions, developments, copies, and bodies of knowledge representted in written materials called corpora. Each corpus may have had multiple authors across vast periods of time in different cultures. Text criticism attempts to unravel the historical, times, and even the embedded sentiments in a corpus. It is a tool often employed by linguists, exegetes, philologists, historians, and others to identify authors, variants and versions, and even semantic meanings and developments along a time-line and across spatial boundaries, at least as represented in textual evidence.
We can decompose a language, and its textual representation, into the atomic units of _**morphemes**_. Even in this term there are three units: _morph_, a root meaning the shape of a word; _eme_ denoting an object, that is, a word, rather than an action; _s_, which denotes in English the plural. All of these aspects are endemic to a textual study of bigrams which are two-word, multiple morpheme, agglutinations of characters, and pairs of words within a defined distance in a corpus. Semantics is a higher viewpoint than grammar, vocabulary, and syntax that attacks the problem of meaning and interpretation, that is, hermeneutics, of what we observe in a corpus. All of these processes, objects, and concepts come into play with textual criticism. A first stab at semantics is a sentiment analysis of the meaning and possible the feeling and connotation of a unit of a corpus.
## The data are the books
We use the `gutenbergr` package to access whatever books Project Gutenberg might have stored for Adam Smith.
```{r}
gutenberg_works(author == "Smith, Adam")
```
This call reveals that Project Gutenberg does not store Smith's "Theory of the moral Sentiments". A search finds this book [here](http://ota.ox.ac.uk/desc/3189). The code uses [Regex](http://www.rexegg.com/regex-quickstart.html) character, digit, and other operators to manage various types of strings and strings of strings like chapters.
```{r}
wealth <- gutenberg_download(gutenberg_id = 3300) %>%
select(-gutenberg_id) %>%
mutate(chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE))),
book = as.factor("Wealth of Nations")
)
glimpse( wealth )
```
```{r}
theory <- read_file("adam-smith-moral-theory.txt") %>%
tibble(text = .) %>%
mutate(text = strsplit(text, "\n")) %>%
unnest(text) %>%
mutate(book = as.factor("The Theory of Moral Sentiments"),
chapter = cumsum(str_detect(text, regex("^Chap. [\\divxlc]",
ignore_case = TRUE))))
glimpse( theory )
```
## Mining the corpus
There are several steps we can follow here. First, and foremost for subsequent analysis, is the descriptive frequency of words in the _corpora_.
### Most used words
Here we count (frequency) the words that Smith uses the most in each of the books.
```{r}
smith_books <- rbind(theory, wealth)
smith_books_tidy <- smith_books %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
smith_books_tidy %>%
count(word, book) %>%
group_by(book) %>%
arrange(desc(n)) %>%
top_n(10) %>%
ggplot(aes(x = fct_reorder(word, n), y = n, fill = book)) +
geom_col() +
coord_flip() +
hrbrthemes::theme_ipsum_tw() +
facet_wrap(~ book, scales = "free") +
ggthemes::scale_fill_gdocs(guide = FALSE) +
labs(x = "")
```
From this simple text critical technique we can anecdotally surmise that
1. The texts differ in frequency of key words. Price contrasts with conduct; country with persons; quantity with virtue.
2. The top ten words for _Theory of Moral Sentiment_ perhaps naturally as to the subject relate to persons, values, and categories of a moral, philosophical argument.
3. The top ten words relate to categories of markets, factors of production, accounting, and economic activity.
We can go further with an _n-gram_ analysis.
```{r}
smith_bigrams <- smith_books %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
group_by(book)
# includes lots of otherwise stop words
# smith_bigrams %>%
# count(bigram, sort = TRUE)
# excludes stop words
bigrams_separated <- smith_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# bigram counts
bigram_counts <- bigrams_filtered %>%
group_by(book) %>%
count(word1, word2, sort = TRUE)
head(bigram_counts)
```
Now we look for the most common bigrams (separated by " ", a blank space) using the `unite()` function.
```{r}
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
head(bigrams_united)
```
Let's busy ourselves with the frequency of terms and the inverse document frequency of terms (TF-IDF). The term frequency is simply the number of times a term, here a bigram, occurs relative to all terms, here bigrams. We weigh this relative frequency by an information importance measure called the inverse document frequency, which we define in natural logarithms as
$$
IDF(term) = log
\left[\frac{n_{documents}}{n_{documents\,containing\,term}} \right]
$$
This log ratio is related to, but not the same as, to the inverse of the odds in favor of seeing a document with the term versus any document.
We multiply the TF by the IDF to weigh the TF away from very common terms and toward terms that otherwise are infrequent.
```{r}
library(janeaustenr)
book_words <- smith_books %>%
unnest_tokens(word, text) %>%
count(book, word, sort = TRUE)
book_tf_idf <- book_words %>%
bind_tf_idf(word, book, n)
book_tf_idf %>%
group_by(book) %>%
slice_max(tf_idf, n = 15) %>%
ungroup() %>%
ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free") +
labs(x = "tf-idf", y = NULL)
```
```{r}
bigram_tf_idf <- bigrams_united %>%
count(book, bigram) %>%
bind_tf_idf(bigram, book, n) %>%
arrange(desc(tf_idf))
head(bigram_tf_idf)
```
Of course, we want to see a picture of our handiwork (really Silge et al.!).
```{r}
library(forcats)
library(plotly)
plt <- bigram_tf_idf %>%
group_by(book) %>%
slice_max(tf_idf, n = 15) %>%
ungroup() %>%
ggplot(aes(tf_idf, fct_reorder(bigram, tf_idf), fill = book)) +
geom_col( show.legend = FALSE ) +
facet_wrap( ~ book, ncol = 2, scales = "free" ) +
labs( x = "tf-idf", y = NULL )
plt
```
### Sentiment Analysis
A typical sentiment analysis will match words by frequency in corpora with a lexicon of conceptual and emotionally connotative words. The analysis will only be as good as the lexicon, as it presents a higher viewpoint of potential comparisons using corpora as data observed.
We can use the `tidytext` package to deploy several sentiment lexicons.
```{r}
glimpse(get_sentiments(lexicon = c("bing", "afinn", "loughran", "nrc")))
```
Wherever a corpus has these words a match will be found in the lexicon. Sentiment scores can then be summed for each match.
```{r message=FALSE, warning=FALSE}
d <- smith_books_tidy %>%
inner_join(get_sentiments(lexicon = c("bing", "afinn", "loughran", "nrc"))) %>%
mutate( score = ifelse( sentiment == "negative", 0, 1) ) %>%
group_by(book, chapter) %>%
summarise(sentiment = sum(score))
plt <- d %>%
ggplot(aes(x = chapter, y = sentiment, color = book)) +
geom_point(alpha = 1/10) +
geom_smooth(se = FALSE) +
facet_wrap(~ book, scales = "free_x") +
hrbrthemes::theme_ipsum_tw() +
ggthemes::scale_color_gdocs(guide = FALSE)
plt
```
The _Theory of Moral Sentiment_ begins on a negative note, which becomes more positive, ending on a higher note than the beginning, with perhaps a cautionary tale at the end. In contrast the _Wealth of Nations_ has an almost uninterrupted climb from low positive (high negative) sentiment scores to much higher counts of positively related words in the final chapters. We cannot emphasize enought that positive and negative sentiments is culture-bound, and time (if not epoch) sensitive.
### TF-IDF
When we retrieve information from a corpus, the use of term frequency (TF)–inverse document frequency (IDF) techniques attempt to measure the importance of a neme (phoneme, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.[1] It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use tf–idf.[
Term Frequency (TF)-IDistribution Frequency techniwues adjusts the term frequency within document (in this case book) for the specificity of the term to that specific document. Thus, terms that are used in both books are weighted down, and terms that are used almost exclusively in one of the books are weighted up. The idea is to find the most distinctive terms for each book. My intuition tells me that this won't be much too different from the original counts (very different vocabularies in the books). Just for the sake of it, let's use bigrams:
```{r}
smith_bigrams <- smith_books %>%
unnest_tokens( bigram, text, token = "ngrams", n = 2 )
bigrams_separated <- smith_bigrams %>%
separate( bigram, c( "word1", "word2" ), sep = " " )
bigrams_filtered <- bigrams_separated %>%
filter( !word1 %in% stop_words$word ) %>%
filter( !word2 %in% stop_words$word ) %>%
filter( !str_detect(word1, "\\d" ),
!str_detect(word2, "\\d" ) )
bigrams_united <- bigrams_filtered %>%
unite( bigram, word1, word2, sep = " " )
bigrams_united %>%
count( book, bigram ) %>%
group_by( book ) %>%
arrange( desc(n) ) %>%
top_n( 10 ) %>%
ggplot( aes( x = fct_reorder(bigram, n), y = n, fill = book ) ) +
geom_col() +
coord_flip() +
facet_wrap( ~ book, scales = "free" ) +
hrbrthemes::theme_ipsum() +
ggthemes::scale_fill_gdocs( guide = FALSE ) +
labs( x = "" )
```
With just the counts, the subject difference is even more striking, fjor example, human nature compared with foreign trade. We might even use the bigram for the "Theory" as the main subject of the book and not be completely wrong.
```{r}
bigrams_united %>%
count( book, bigram ) %>%
bind_tf_idf( bigram, book, n ) %>%
group_by( book ) %>%
arrange( desc(tf_idf) ) %>%
top_n( 10 ) %>%
ggplot( aes(x = fct_reorder(bigram, tf_idf), y = tf_idf, fill = book ) ) +
geom_col() +
coord_flip() +
facet_wrap( ~ book, scales = "free" ) +
hrbrthemes::theme_ipsum() +
ggthemes::scale_fill_gdocs( guide = FALSE ) +
labs(x = "")
```
The TF-IDF adjustment did not change much.
### A network of words
We can depict word pairs as nodes with correlated edges.
```{r}
library(igraph)
# filter for only relatively common combinations
bigram_graph <- bigram_counts %>%
filter(n > 30) %>%
graph_from_data_frame()
#bigram_graph
library(ggraph)
set.seed(42)
ggraph(bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
```
Bigrams are interesting artifacts because they represent specific constrained word pairings. In linguistics there are also other pairings including mono- and bigram pairings we might investigate using a correlation design. We can analyze pairs of words, and also bigrams, which more likely appear within a defined distance, such as a chapter or section, rather than just by themselves. This is the beginning of building a semantic structure.
For binary combinations, $\phi$ calculates the correlation. We suppose we search for the four combinations of finding two words $X$ and $Y$:
- *not X and not Y* with $n_{00}$ of these,
- *not X and Y* with $n_{01} out there,
- *X and not Y* with $n_{10} of these, finally,
- *X and Y* with $n_{11}$ present in the data.
Then
$$
\phi = \frac{n_{11}n_{00}-n_{10}n_{01}}{\sqrt{(n_{00}+n_{10})(n_{01}+n_{11})(n_{00}+n_{01})(n_{10}+n_{11})}}
$$
The following bordered sums frequency table summarizes the calculation.
$$
\begin{bmatrix}
&|& not\,X & Y & | & \\
-----& | & ----- & ---- & | & ----- \\
not\,X & | & n_{00} & n_{01} & | & n_{00} + n_{01} \\
Y & | & n_{10} & n_{11} & | & n_{10} + n_{11} \\
-----& | & ----- & ----- & & -----\\
& | & n_{00}+n_{10} & n_{01}+n_{11} & | & n
\end{bmatrix}
$$
These are all in row-column configuration, where, for example $n_{01}$ is the number of times in the corpus $not\,X$ and $Y$ occur. Also $n$ is the number of occurrences of either $X$ or $Y$. It would be correct to interpret the numerator of $\phi$ as a binary covariance of $X$ with $Y$ and the numerator as the product of the standard deviations of $X$ and $Y$ occurring.
The `corrr` package `correlation()` function calculates %\phi$.
```{r}
# unfertige Erzeugnisse
#
# another function whose flow we might want to repeat
# first a bespoke function to build a group-wise pairing correlation
pairwise_book <- function(df) {
df %>%
group_by(word) %>%
filter(n() >= 100) %>%
pairwise_cor(word, chapter, sort = TRUE)
}
# back to the tidy version of the books
smith_correlations <- smith_books_tidy %>%
filter(!str_detect(word, "\\d")) %>%
group_by(book) %>%
do(pairwise_book(.)) %>%
group_by(book) %>%
arrange(desc(correlation)) %>%
top_n(20, correlation) %>% ungroup
smith_correlations %>%
ggraph( layout = "fr" ) +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 3) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_graph()
```
```{r, eval = FALSE}
corr_graph <- function(df) {
ggraph(df, layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 3) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_graph()
}
smith_data <- smith_correlations %>%
filter(correlation > 0.1) %>%
group_by(book)
smith_data %>%
ggraph( layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 3) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_graph()
smith_data[[2]] %>%
map(graph_from_data_frame) %>%
map(corr_graph) %>%
reduce(cowplot::plot_grid, labels = c("Wealth of Nations", "Theory of Moral Sentiments"))
# %>%
# nest()
```
We have two very dintinct clusters linked by _altogether_. In other parameterizations and filters the word is _human_. Words in _Moral_ are abstract, conceptual, a turn of philosophical terminology about the human condition and decisions made by humans in this condition. Notions of value, judgment, nature, and even probability enter the network.
The words in the network of ideas in _Wealth_ denote specific times, places, amounts, nations, products, price as value, markets as the nexus of human exchange, bellicose events, and political forces. Words in _Wealth_ are practical, measurable, observable immediately, and related through commerce and polity.
## Some provisional hypotheses
In a sense, without knowing anything except the assumed common authorship of the two works, one hypothesis is that _Wealth_ is the practical science of an _applied Morality_.
Another hypothesis, given that the publication of _Moral_ comes before that of _Wealth_ is whether or not the contentions in _Moral_ continue into _Wealth_.