-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathun_general_debates_report.Rmd
672 lines (455 loc) · 27.5 KB
/
un_general_debates_report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
---
title: "United Nations General Debates: Uncovering International Political Topics through Machine Learning"
author: "P. Prado"
date: "07/01/2020"
output:
pdf_document:
latex_engine: xelatex
number_sections: TRUE
mainfont: Helvetica Neue
sansfont: Helvetica Neue Light
abstract: "This paper analyses the United Nations (UN) General Debates dataset provided by Harvard's Dataverse with the objective to uncover the main topics discussed over the years from 1970 to 2018. The UN General Debates dataset includes documents with the yearly speeches delivered by world leaders, from which the main topics in those documents can be revealed by (i) data preprocessing and cleaning, (ii) application of Machine Learning, more specifically and mainly the LDA algorithm, and (iii) data analysis. The algorithm was set to identify 20 topics of which 12 were chosen to further this study; those were (i) Peace in Africa, (ii) War & Terrorism, (iii) Korea, (iv) Israel & Palestine, (v) Peace in Iraq, (vi) Security Council, (vii) Human Rights, (viii) European Conflicts, (ix) Nuclear Weapons, (x) South Africa & Namibia, (xi) Climate Change, and (xii) Economic Development. The time series analysis on those topics revealed trends aligned with known historical events such as the African conflicts (Independence of Namibia in 1991), Yugoslav wars, the Global Financial Crisis and Climate Change, with the latter being the most prevailing topic in the last decade. Although satisfactory results are achieved, not only in determining the topics but also in revealing the trend overtime as well as relationships between topics and continents, further improvements are still possible. This could include tuning the algorithm to find the best number of topics to the LDA algorithm input, the employment of alternative unsupervised Machine Learning algorithms such as PCA and K-means, and even a combination with supervised learning techniques such as Random Forests. The current analysis can also be expanded to combine sentiment analysis to understand the regions' - or countries' - views on the given topics."
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, cache = TRUE, background = '#F7F7F7', fig.align = "center", out.width = "60%")
```
```{r load or download required libraries, cache=FALSE}
if (!require(knitr))
install.packages("knitr", repos = "http://cran.us.r-project.org")
## Tidy data
if (!require(tidyverse))
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if (!require(tidytext))
install.packages("tidytext", repos = "http://cran.us.r-project.org")
if (!require(textdata))
install.packages("textdata", repos = "http://cran.us.r-project.org")
if (!require(textstem))
install.packages("textstem", repos = "http://cran.us.r-project.org")
if (!require(readtext))
install.packages("readtext", repos = "http://cran.us.r-project.org")
if (!require(countrycode))
install.packages("countrycode", repos = "http://cran.us.r-project.org")
if (!require(grid))
install.packages("grid", repos = "http://cran.us.r-project.org")
if (!require(kableExtra))
install.packages("kableExtra", repos = "http://cran.us.r-project.org")
if (!require(png))
install.packages("png", repos = "http://cran.us.r-project.org")
## Machine learning
if (!require(quanteda))
install.packages("quanteda", repos = "http://cran.us.r-project.org")
if (!require(topicmodels))
install.packages("topicmodels", repos = "http://cran.us.r-project.org")
## Visualisation
if (!require(circlize))
install.packages("circlize", repos = "http://cran.us.r-project.org")
if (!require(ggplot2))
install.packages("ggplot2", repos = "http://cran.us.r-project.org")
if (!require(gridExtra))
install.packages("gridExtra", repos = "http://cran.us.r-project.org")
```
```{r generate-dataset in un_data}
dl <- tempfile() ## download the zip file to a temporary location
download.file(
"https://github.com/pzprado/un-general-debates/raw/master/UNGDC+1970-2018.zip",
dl
)
unzip(dl, exdir = paste0(getwd(), '/UN_data'), overwrite = TRUE) ## unzip the file
undir <- paste0(getwd(), '/UN_data') ## point to the directory of the files
un_files <- ## generate the dataset
readtext(
paste0(undir, "/Converted sessions/*/*.txt"),
docvarsfrom = "filenames",
docvarnames = c("country", "session", "year"),
dvsep = "_",
encoding = "UTF-8"
)
un_data <- ## consolidate the master dataset with country names and continent
un_files %>%
mutate(
country_name = countrycode(
sourcevar = un_files$country,
origin = 'iso3c',
destination = 'iso.name.en'
),
continent = countrycode(
sourcevar = un_files$country,
origin = 'iso3c',
destination = 'continent'
)
) %>%
select(doc_id, text, country, country_name, continent, session, year)
## fixing NA due to no match for YDYE, CSK, YUG, DDR and EU in the countrycode package
un_data[un_data$country == 'YDYE', "country_name"] <- "Yemen, Democratic"
un_data[un_data$country == 'YDYE', "continent"] <- "Asia"
un_data[un_data$country == 'CSK', "country_name"] <- "Czechoslovakia"
un_data[un_data$country == 'CSK', "continent"] <- "Europe"
un_data[un_data$country == 'YUG', "country_name"] <- "Yugoslavia"
un_data[un_data$country == 'YUG', "continent"] <- "Europe"
un_data[un_data$country == 'DDR', "country_name"] <- "East Germany"
un_data[un_data$country == 'DDR', "continent"] <- "Europe"
un_data[un_data$country == 'EU', "country_name"] <- "European Union"
un_data[un_data$country == 'EU', "continent"] <- "Europe"
un_data$doc_id <- sub(".txt", "", un_data$doc_id) ## fix the doc_id pattern
#rm(dl, undir, un_files) ## keep the environment clean
```
# Introduction
The United Nations (UN) General Debates are held every year as part of the yearly General Assembly meeting. On this occasion, world leaders gather together to discuss and share their views on topics that affect the world, coutries and entities they represent. The opening statement from each leader is made available in the United Nations General Debates dataset.^[Jankin Mikhaylov, Slava; Baturo, Alexander; Dasandi, Niheer, 2017, "United Nations General Debate Corpus", https://doi.org/10.7910/DVN/0TJX8Y, Harvard Dataverse, V5] The dataset contains the documented speeches for the period from 1970 to 2018 which is valuable in understanding how the countries concerns - or international political agenda - varied over time.
The goal of this study is to apply Machine Learning techniques to uncover the topics discussed in the UN General Debates documents, a task that otherwise would require extensive human resources to read and categorise them. The process of Unsupervised Machine Learning, also referred to as Topic Modelling, together with robust data analysis allows to answer questions such as: (i) which topics dominated the debates over the 49-year period and (ii) the relationship between topics and continents.
# Method and Analysis
Natural Language Processing (NLP) involves the challenge of analysing unstratuctured data. As such, the key steps in this study are the data preprocessing and cleaning, the transformation of the texts into word tokens, the application of Machine Learning algorithms and the visualisation of the correlated data. While seemingly simple, there is extensive work required particularly in data cleaning and preprocessing as well as the implementation of the Latent Dirichlet Allocation (LDA) algorithm to identify topics, tasks that may result in a few hours of computation processing time.
## _Exploration_
The UN General Debates dataset contains 8,093 opening statements from the world leaders that attend the annual meeting. A look at the dataset shows the following information:
```{r un_data summary}
class(un_data)
glimpse(un_data)
```
A full summary is provided with the code below, demonstrating that the dataset comprises statements from 1970 to 2018.
```{r un-data-stats, echo = TRUE}
un_data.stats <- summary(un_data)
un_data.stats
```
Another summary can be made focusing on the number of unique documents, countries and continents. The 8,093 opening statements were delivered by 200 world leaders in 5 continents over the 49-year period of the dataset.
```{r unique-numbers}
un_data %>%
summarise(
documents = n_distinct(doc_id),
years = n_distinct(year),
countries = n_distinct(country),
continents = n_distinct(continent)
)
```
It is worth noting that the General Debates in 2018 included 196 countries/documents (193 UN country members, the European Union and the observer states of the Holy See and the State of Palestine) as seen in the extract below. This number is lower than 200 countries in the data set, which derives from the political changes that resulted in the merger or separation of countries (e.g. Yugoslavia) over time.
```{r 2018-unique-numbers}
un_data %>%
filter(year == '2018') %>%
group_by(year) %>%
summarise(
documents = n_distinct(doc_id),
countries = n_distinct(country),
continents = n_distinct(continent)
)
```
## _Data preprocessing and analysis_
The `un_data` dataset is a data frame object which is not the best to analyse text data. The corpus data class is widely utilised in Natural Language Processing (NLP) therefore the dataset conversion is the first step. The preprocessing tasks in this study are:
1. Corpus conversion;
2. Lemmatisation;
3. Tokenization; and
4. Cleaning.
The steps are briefly described below:
### Step 1: Corpus conversion
The conversion of data frame into corpus is done with the package `quanteda` with a simple line of code as demonstrated below.
```{r step-1-convert-corpus, echo = TRUE}
un_corpus <- corpus(un_data, text_field = "text")
class(un_corpus)
```
Now with a corpus object, the summary provides a lot more information, already containing data such as the number of tokens (words) and sentences per document.
```{r corpus-summary}
summary(un_corpus, n = 5)
```
From the number of tokens and sentences it is possible to see how the length of the speeches changed over time. The plots below, a similar trend can be observed between the length and number of speeches delivered by world leaders in each year. With the increase of country members, the UN has likely implemented measures to limit the time that each country had to deliver their speech.
```{r speeches-plots, fig.show = "hold", out.width = "40%", fig.height = 4}
un_corpus.stats <- as.data.frame(summary(un_corpus, n = 8093))
tokens_plot <- un_corpus.stats %>%
group_by(year, continent) %>%
summarize(tokens = mean(Tokens)) %>%
ggplot(aes(year, tokens, color = continent)) +
geom_smooth(method = 'loess', fill = 'NA') +
theme(legend.position = "top") +
ggtitle('Average speech tokens per continent') +
labs(color = NULL)
sentence_plot <- un_corpus.stats %>% group_by(year, continent) %>%
summarize(sentences = mean(Sentences)) %>%
ggplot(aes(year, sentences, color = continent)) +
geom_smooth(method = 'loess', fill = 'NA') +
theme(legend.position = "none") +
ggtitle('Average speech sentences per continent')
speeches_plot <- un_corpus.stats %>%
group_by(year) %>%
summarize(countries = n_distinct(country)) %>%
ggplot(aes(year, countries)) +
geom_smooth(method = 'loess', fill = 'NA') +
ggtitle('Speeches per year')
#grid.arrange(tokens_plot, sentence_plot, speeches_plot, nrow = 3)
tokens_plot
sentence_plot
speeches_plot
```
### Step 2: Lemmatisation
The analysis of text data requires grouping words found in texts. However, grouping exact matches within a text would not yield good results in topic modelling due to the inflection of words (e.g. consulting, consultant, consultation). There are many approaches to solving this problem but the most popular ones are Stemming and Lemmatisation.
In short, Stemming stands for "cutting" part of the word to reach its root. In this case, "consulting" and "consultant" would be reduced to "consult". On the other hand, Lemmatisation looks at the morphological meaning of the word, as defined in the Cambridge Dictionary:
>_the process of reducing the different forms of a word to one single form, for example, reducing "builds", "building", or "built" to the lemma "build":_
>* _Lemmatization is the process of grouping inflected forms together as a single base form._
>* _In dictionaries, there are fixed lemmatization strategies._
Each approach has advantages and disadvantages. Stemming is a faster process but the results may not be as realiable, for instance, "popular" and "population" would become "popula". Lemmatisation, on the other hand, preserves better meaning of the words albeit the processing time being extremely long. In this study, the latter has been used.
```{r step-2-lemmatisation, echo = TRUE}
un_corpus_lemma <- # create a new corpus to preserve the original corpus
corpus(un_data, text_field = "text")
start <- Sys.time()
un_corpus_lemma$documents$texts <-
lemmatize_strings(un_corpus_lemma$documents$texts, dictionary = lexicon::hash_lemmas)
Sys.time()-start
```
The processing time is indicated above as the "Time difference" between start and end of the Lemmatisation.
### Step 3: Tokenization
Up to now, the text data is stored in documents as strings, that is, the text for each document is a single string. Tokenization - in NLP - is the process of splitting the strings into separate words (or tokens). This process will still keep the tokens allocated to each document in the corpus.
```{r step-3-tokenization, echo = TRUE}
un_tokens <-
quanteda::tokens(
un_corpus_lemma
)
```
A high level look into the results below already indicates why cleaning is important in this study. The tables below show the top 10 and bottom 10 terms.
```{r dirty-tokens}
top_terms.d <-
un_tokens %>% dfm() %>% tidy() %>% group_by(term) %>%
summarize(count = n()) %>% arrange(desc(count)) %>% slice(1:10)
bot_terms.d <-
un_tokens %>% dfm() %>% tidy() %>% group_by(term) %>%
summarize(count = n()) %>% arrange(count) %>% slice(1:10)
top_terms.d
bot_terms.d
```
### Step 4: Cleaning
In this step, the objective is to transform all tokens into lowercase and eliminate tokens that are symbols, URLs, punctuation, stopwords, numbers and hyphens. The `quanteda` package is used for lowercase transformation and removal of stopwords, however it must be done by transforming the object into a Data Feature Matrix. The result shows an improvement as noted below.
```{r step-4-cleaning}
un_words <- c("unite", "nation", "country", "international",
"united", "nations", "countries", "will", "world")
un_tokens <-
quanteda::tokens(
un_corpus_lemma,
remove_numbers = TRUE,
remove_twitter = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_url = TRUE,
remove_hyphens = TRUE,
include_docvars = TRUE
)
un_dfm <-
dfm(un_tokens,
tolower = TRUE,
stem = FALSE,
remove = c(stopwords("english"),
un_words
)
)
top_terms.cl <-
un_dfm %>% tidy() %>% group_by(term) %>%
summarize(count = n()) %>% arrange(desc(count)) %>% slice(1:10)
bot_terms.cl <-
un_dfm %>% tidy() %>% group_by(term) %>%
summarize(count = n()) %>% arrange(count) %>% slice(1:10)
top_terms.cl
bot_terms.cl
```
Cleaning can still be improved by (i) trimming the words rarely occurred and (ii) removing words that do not add value to topic modelling. In the first case (i), as seen below, a few words have numbers instead of letters such as "0ctober" and "0n" due to the processing of the texts stored as image files prior to 1992 (Baturo et al. 2017). In the second case (ii), the words "nation", "unite", "international", "country" and "world" are amongst the most frequent ones as they directly relate to the United Nations, therefore also not adding value to topic modelling.
The trimming can be done to remove very low occurrences. Removing the bottom 0.002% yields the following result:
```{r dfm-trimming}
un_dfm <- dfm_trim(un_dfm,
min_docfreq = 0.002,
docfreq_type = "prop"
)
top_terms.tr <-
un_dfm %>% tidy() %>% group_by(term) %>%
summarize(count = n()) %>% arrange(desc(count)) %>% slice(1:10)
bot_terms.tr <-
un_dfm %>% tidy() %>% group_by(term) %>%
summarize(count = n()) %>% arrange(count) %>% slice(1:10)
top_terms.tr
bot_terms.tr
```
A word cloud plot helps visualise the frequency of the top words comparatively. The size of the words represent their frequency.
```{r total-wordcloud}
un_dfm %>% textplot_wordcloud()
```
## _LDA_
There are various Machine Learning algorithms that can be applied in NLP. This study utilises the Latent Dirichlet Allocation (LDA) algorithm for topic modelling introduced by Blei et al. (2003). It was chosen due to its frequent use in unsupervised learning within NLP. There is extensive technical explanation about LDA in the referenced publication, therefore this study will briefly explain how fitting the model works.
A known limitation of the LDA algorithm is that the number of topics (or clusters) need to be specified upfront. In the case of the UN General Debates this can mean losing relevant insights. For now, the target will be to identify 20 topics.^[20 was arbitrarily determined.]
To fit the model, the parameters below are specified into the code that follows:
* _k_ for the number of topics
* _seed_ for the replication of the results
* _method_ for the sampling method^[This LDA application utilised the method Gibbs for sampling (Resnik and Hardisty 2010).]
```{r LDA-fit, echo = TRUE}
un_dtm <- convert(un_dfm, to = "topicmodels")
k <- 20 #number of topics
seed = 123 #necessary for reproducibility
start <- Sys.time() #calculating the total runtime
lda <- LDA(un_dtm, k = k, method = "GIBBS", control = list(seed = seed))
Sys.time()-start # total runtime between end and start of LDA processing
```
The "time difference" shown above indicates, again, how long it took for the LDA function to be processed.
There are many other parameters that can be adjusted within the LDA function of the code, those include the samples to discard, iterations, etc. Those were not adjusted in this study in order to assess the performance of a standard application of the algorithm to the UN General Debates dataset.
# Results
## Topics
The following is a result of the 20 topics found within the 8,093 documents withe the 15 most relevant words that form the topic.
```{r LDA-output}
as.data.frame(terms(lda, 15)) %>%
select(1:10) %>%
kable(format = "latex") %>%
kable_styling(latex_options = c("striped", "scale_down"))
as.data.frame(terms(lda, 15)) %>%
select(11:20) %>%
kable(format = "latex") %>%
kable_styling(latex_options = c("striped", "scale_down"))
```
As seen above, a few topics do not provide much meaning by looking at the first 15 words. This is because, certainly, a lot of the documents make reference to the General Assembly of the UN and some general purposes of the UN as an organisation itself. Therefore, a few topics are separated below to proceed further in the study, those are:
* Topic 1: Peace in Africa
* Topic 2: War & Terrorism
* Topic 3: Korea
* Topic 4: Israel & Palestine
* Topic 6: Peace in Iraq
* Topic 8: Security Council
* Topic 10: Human Rights
* Topic 12: European Conflicts
* Topic 13: Nuclear Weapons
* Topic 14: South Africa & Namibia
* Topic 16: Climate Change
* Topic 17: Economic Development
The reduction from 20 to 12 topics shall facilitate the visualisation of topic trends and topic relationships.
```{r study-topics}
study_topics <- c(1, 2, 3, 4, 6, 8, 10, 12, 13, 14, 16, 17
)
topic_names <- c('Peace in Africa',
'War & Terrorism',
'Korea',
'Israel & Palestine',
'Peace in Iraq',
'Security Council',
'Human Rights',
'European Conflicts',
'Nuclear Weapons',
'South Africa & Namibia',
'Climate Change',
'Economic Development')
df_topics <- data.frame(name = topic_names, topic = study_topics)
```
## _Topic trends_
With the topics already generated, it is possible to start analysing relationships between topics, year, continents and even countries.^[Country analysis is excluded from this study.]
This is possible because LDA not only generated the topics found in the documents (based on the combination of words) but also created the probabilities of each topic within each document.
This probability is referred to as gamma $\gamma$.
```{r topic-year-plot}
#Extract gamma, join metadata, create average gamma
year_topic_relationship <- tidy(lda, matrix = "gamma") %>%
inner_join(un_data, by = c("document" = "doc_id")) %>%
select(year, topic, gamma) %>%
group_by(year, topic) %>%
summarize(gamma = mean(gamma))
year_topic_relationship %>%
filter(topic %in% study_topics) %>%
inner_join(df_topics, by = "topic") %>%
ggplot(aes(x = year, y = gamma, colour = factor(name))) +
geom_line(size = 1.5) +
scale_color_brewer(palette = "Paired") +
labs(colour = NULL)
```
From the plot above, the probability of the topics being debated by the UN can be directly attributed to historical facts, such as:
* Topic 17: Economic Development peaked between 2005-2010, correlating to the Global Financial Crisis of 2007-2008
* Topic 12: European Conflicts peaked between 1990-1995, correlating to the Yugoslav wars between 1991-2001
* Topic 14: South Africa & Namibia peaked between 1975-1980 remaining high during 1980s, correlating to the South African Border War between 1966-1989 and Namibia's Independence in 1990
* Topic 6: Peace in Iraq peaked between 2010-2015, correlating to the Iraq War between 2003-2011
Those are just a few. As observed in the plots above, the Topic 16: Climate Change, has the most notable increase over recent years. Based on this data it is possible to conclude that Climate Change has been prevailing in the UN General Debates since 2010.
Representing 12 topics visually is still a lot, therefore, the continent-topic relationship is shown below only in relation to the 5 topics discussed above.
```{r topic-continent-plot, dev="png", out.width="100%", dpi=600, fig.show="hide"}
topics.5 <- c(6, 12, 14, 16, 17)
#Extract gamma, join metadata, create average gamma
continent_topic_relationship <- tidy(lda, matrix = "gamma") %>%
filter(topic %in% topics.5) %>%
inner_join(un_data, by = c("document" = "doc_id")) %>%
select(continent, topic, gamma) %>%
group_by(continent, topic) %>%
summarize(gamma = mean(gamma))
grid.col = c("Americas" = "blue", "Asia" = "yellow", "Europe" = "red",
"Africa" = "green", "Oceania" = "brown", "6" = "grey", "12" = "grey",
"14" = "grey", "16" = "grey", "17" = "grey")
circos.clear()
circos.par(gap.after = c(rep(5, length(unique(continent_topic_relationship[[1]])) - 1), 15,
rep(5, length(unique(continent_topic_relationship[[2]])) - 1), 15))
# ChordDiagram (Continent topic relationship)
chordDiagram(continent_topic_relationship, grid.col = grid.col)
title("All Time Topic and Continent Relationship")
grid.text(" 6 Peace & Iraq
12 European Conflicts
14 South Africa & Namibia
16 Climate Change
17 Economic Development",
x=unit(0.05, "npc"), y=unit(0.8, "npc"), just="left",
gp=gpar(fontsize=8))
continent_topic_relationship <- tidy(lda, matrix = "gamma") %>%
inner_join(un_data, by = c("document" = "doc_id")) %>%
filter(topic %in% topics.5, year == "1990") %>%
select(continent, topic, gamma) %>%
group_by(continent, topic) %>%
summarize(gamma = mean(gamma))
circos.clear()
circos.par(gap.after = c(rep(5, length(unique(continent_topic_relationship[[1]])) - 1), 15,
rep(5, length(unique(continent_topic_relationship[[2]])) - 1), 15))
chordDiagram(continent_topic_relationship, grid.col = grid.col)
title("Topic and Continent Relationship in 1990")
continent_topic_relationship <- tidy(lda, matrix = "gamma") %>%
inner_join(un_data, by = c("document" = "doc_id")) %>%
filter(topic %in% topics.5, year == "2000") %>%
select(continent, topic, gamma) %>%
group_by(continent, topic) %>%
summarize(gamma = mean(gamma))
circos.clear()
circos.par(gap.after = c(rep(5, length(unique(continent_topic_relationship[[1]])) - 1), 15,
rep(5, length(unique(continent_topic_relationship[[2]])) - 1), 15))
chordDiagram(continent_topic_relationship, grid.col = grid.col)
title("Topic and Continent Relationship in 2000")
continent_topic_relationship <- tidy(lda, matrix = "gamma") %>%
inner_join(un_data, by = c("document" = "doc_id")) %>%
filter(topic %in% topics.5, year == "2018") %>%
select(continent, topic, gamma) %>%
group_by(continent, topic) %>%
summarize(gamma = mean(gamma))
circos.clear()
circos.par(gap.after = c(rep(5, length(unique(continent_topic_relationship[[1]])) - 1), 15,
rep(5, length(unique(continent_topic_relationship[[2]])) - 1), 15))
chordDiagram(continent_topic_relationship, grid.col = grid.col)
title("Topic and Continent Relationship in 2018")
```
```{r chord-diagrams, out.width="100%"}
fig1 <-
rasterGrob(as.raster(readPNG(
paste0(
getwd(),
'/un_general_debates_report_files/figure-latex/topic-continent-plot-1.png'
)
)), interpolate = FALSE)
fig2 <-
rasterGrob(as.raster(readPNG(
paste0(
getwd(),
'/un_general_debates_report_files/figure-latex/topic-continent-plot-2.png'
)
)), interpolate = FALSE)
fig3 <-
rasterGrob(as.raster(readPNG(
paste0(
getwd(),
'/un_general_debates_report_files/figure-latex/topic-continent-plot-3.png'
)
)), interpolate = FALSE)
fig4 <-
rasterGrob(as.raster(readPNG(
paste0(
getwd(),
'/un_general_debates_report_files/figure-latex/topic-continent-plot-4.png'
)
)), interpolate = FALSE)
# Organise the grid
grid.arrange(fig1, fig2, fig3, fig4, ncol = 2)
```
While the continents' relationship with the Climate Change topic is somewhat homogeneous (particularly in 2018), other topics have a clear stronger relationship to the continent where such topic has initially emerged. The relationships are:
* Economic Development and the Americas
* European Conflicts and Europe
* Peace in Iraq and Asia
* South Africa & Namibia and Africa.
# Conclusion
The UN General Debates dataset is extremely valuable in providing insights about international politics. In this study, the application of Unsupervised Machine Learning via the LDA algorithm proved effective in uncovering the main topics and their trends. Some of the identified topics related to historic events of regional wars, economic development and other international issues. According to those results, the most prevalent topic of the of the past decade is Climate Change. It reached the highest probability - over 20% - recorded for the analysed period, followed by the topics related to South Africa & Namibia and European Conflicts.
In terms of expanding this study, possible improvements include the benchmarking of other topic modelling algorithms such as PCA and K-means. This work can be further developed via the application of supervised learning techniques and the addition of sentiment analysis which produce invaluable insights in understanding the countries perspectives on certain matters. Lastly, such understanding of countries perspectives would allow to conclude whether countries within the same continent discuss the same topics.
# References
Alexander Baturo, Niheer Dasandi, and Slava Mikhaylov, "Understanding State Preferences With Text As Data: Introducing the UN General Debate Corpus" Research & Politics, 2017.
Blei DM, Ng AY, Jordan MI. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research. 3 (4–5):993–1022.
Resnik P, Hardisty E. 2010. Gibbs sampling for the uninitiated. Technical Report UMIACS-TR-2010-04, University of Maryland. http://drum.lib.umd.edu//handle/1903/10058.