-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata.qmd
888 lines (652 loc) · 31.4 KB
/
data.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
# Data {#sec-data}
```{r}
#| echo: false
#| message: false
#| results: asis
source("warning.R")
```
```{r}
#| echo: false
#| message: false
#| warning: false
library(conflicted)
conflicts_prefer(dplyr::filter)
```
## Data sets {#sec-data-sets}
This book is about tools for creating visualizations. But to visualize data you first need data. So let's start by taking a look at some of the data sets available to us without much hassle.
### base R {#sec-base-r}
Base R comes with a bunch of data sets ready to use. There are classics like **iris** and **mtcars**, but there are many more to choose from.
```{r}
#| echo: false
#| message: false
#| warning: false
#| results: asis
library(dplyr)
library(purrr)
library(rmarkdown)
data() %>%
pluck(3) %>%
as_tibble() %>%
filter(Package == "datasets") %>%
select(Item, Title) %>%
arrange(Item, .locale = "en") %>%
paged_table()
```
Since the **datasets** package comes from base R, the data is not always immediately ready to use with **ggplot2** [@wickham2024]. Luckily we have the **tidyverse** [@wickham2023b] packages that make it easy to make the necessary changes.
Here is an example using the **WorldPhones** ('The World's Telephones') data set. We can start by loading the data set by using the `data()` function.
```{r}
data(WorldPhones)
```
Let's take a quick look at what the first couple of rows of the data set looks like.
```{r}
head(WorldPhones)
```
WorldPhones is a matrix with 7 rows and 8 columns. The columns give the figures for a given region, and the rows the figures for a year. We would like to turn it into a *tidy* format. We can use the **tibble** [@müller2023] package for the first part. And then we'll use `pivot_longer()` from the **tidyr** package [@wickham2024b]. It increases the number of rows and decrease the number of columns. We want the continents to be observations, not columns.
```{r}
library(dplyr)
library(tibble)
library(tidyr)
world_phones_tbl <- WorldPhones %>%
as.data.frame() %>%
rownames_to_column(var = "Year") %>%
as_tibble() %>%
pivot_longer(
cols = !Year,
names_to = "Continent",
values_to = "Phones"
) %>%
mutate(across(where(is.character), as.factor))
```
```{r}
#| echo: false
#| message: false
#| warning: false
#| results: asis
world_phones_tbl %>%
paged_table()
```
What we're left with is a tibble with three columns, **Year**, **Continent**, and **Phones**. We can then use the new tibble to create a simple graph with ggplot2.
```{r}
#| message: false
#| warning: false
#| results: asis
library(ggplot2)
world_phones_tbl %>%
ggplot(aes(Year, Phones, color = Continent, group = Continent)) +
geom_line() +
theme_bw()
```
### IMDb movies (1893-2005) {#sec-imdb-movies-1893-2005}
**ggplot2movies** [@wickham2015] used to be a part of the ggplot2 package itself. It’s now its own package to make ggplot2 lighter.
But it’s a cool little package. It has [Internet Movie Database (IMDb)](https://www.imdb.com/) data about movies from between 1893 and 2005. The selected movies have "a known length and had been rated by at least one \[IMDb\] user." [@wickham2015].
The **Movies** data set has qualities that make it good for our needs. Let’s start by loading it.
```{r}
#| message: false
#| warning: false
#| results: asis
library(ggplot2movies)
data(movies)
```
Let's take a quick look at what some of the data looks like.
```{r}
#| eval: false
head(movies)
```
```{r}
#| echo: false
#| message: false
#| results: asis
head(movies) %>%
paged_table()
```
```{r}
#| echo: false
#| message: false
#| warning: false
# Get the number of columns
.movies_ncols <- movies %>%
ncol()
# Get the row count
.movies_rowcount <- movies %>%
tally()
```
Movies is already a tibble. It consists of `r .movies_rowcount` rows (observations) and `r .movies_ncols` columns (variables).
When starting to work with a new data set it's always good to take a look at the documentation. To understand what is in those rows and columns (and what is not).
```{r}
#| echo: false
#| message: false
#| results: asis
library(tibble)
tribble(
~Variable, ~Description,
"title", "Title of the movie",
"year", "Year of release",
"budget", "Total budget (if known) in US dollars",
"length", "Length in minutes",
"rating", "Average IMDB user rating",
"votes", "Number of IMDB users who rated this movie",
"r1", "Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 1",
"r2", "Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 2",
"r3", "Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 3",
"r4", "Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 4",
"r5", "Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 5",
"r6", "Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 6",
"r7", "Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 7",
"r8", "Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 8",
"r9", "Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 9",
"r10", "Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 10",
"mpaa", "MPAA rating",
"action", "Binary variable representing if movie was classified as belonging to that genre",
"animation", "Binary variable representing if movie was classified as belonging to that genre",
"comedy", "Binary variable representing if movie was classified as belonging to that genre",
"drama", "Binary variable representing if movie was classified as belonging to that genre",
"documentary", "Binary variable representing if movie was classified as belonging to that genre",
"romance", "Binary variable representing if movie was classified as belonging to that genre",
"short", "Binary variable representing if movie was classified as belonging to that genre"
) %>%
paged_table()
```
Here are some of the reasons why Movies is a good example data set because it includes:
- A goldilocks amount of data. Not too little, not too much
- *Categorical* data of both *nominal* (title, genre) and *ordinal* (mpaa) kind
- *Numerical* data of both *continuous* (budget, length, rating) and *discrete* (year, votes) kind
We can use two of those columns, **year** and **rating** to create a simple visualization with ggplot2.
```{r}
#| message: false
#| warning: false
#| results: asis
library(ggplot2)
movies %>%
ggplot(aes(year, rating)) +
geom_point(alpha = 0.05) +
theme_bw()
```
As mentioned earlier, Movies is already a tibble. But, it doesn't mean that the data is in an optimal format for all kinds of visualization. But we'll do all the necessary data wrangling within the chapter where we use the data.
### RDatasets {#sec-rdatasets}
[**RDatasets**](https://vincentarelbundock.github.io/Rdatasets/articles/data.html) is not an R package. But it is an excellent GitHub repo. And a “collection of datasets originally distributed in various R packages" [@arel-bundock2024].
```{r}
#| echo: false
#| message: false
#| warning: false
library(readr)
library(stringr)
# Get filename
filename <- list.files(
path = "data",
pattern = "RDatasets"
)
# Extract the date
.RDatasets_date <- str_extract(filename, "\\d{4}-\\d{2}-\\d{2}")
# Get the data
.RDatasets <- read_csv2("data/RDatasets-2024-11-11.csv") %>%
select(
Package,
Item,
Title,
Rows,
Cols,
n_binary,
n_character,
n_factor,
n_logical,
n_numeric
)
# Get the row count
.RDatasets_rowcount <- .RDatasets %>%
tally()
```
Here listed are the `r .RDatasets_rowcount` data sets that were available on `r .RDatasets_date`.
```{r}
#| echo: false
#| message: false
#| warning: false
#| results: asis
.RDatasets %>%
paged_table()
```
The RDatasets repo contains that same list. But there you will also find a .csv file and documentation for each data set.
If I had to choose one fun data set from the list to highlight, it would be **starwars** from the **dplyr** [@wickham2023c] package.
You can choose to use the .csv file provided on the website. Another way to use the collection is to choose the dataset from the list and load the package it comes with:
```{r}
library(dplyr)
data(starwars)
```
Let's take a quick look at what some of the starwars data looks like.
```{r}
#| eval: false
head(starwars)
```
```{r}
#| echo: false
#| message: false
#| results: asis
head(starwars) %>%
paged_table()
```
There are a bunch of Star Wars characters and their stats.
Let's choose two columns, **height** and **species** (and filter for six of the more well-known species). We'll use them to create a simple visualization with ggplot2.
```{r}
#| message: false
#| warning: false
#| results: asis
library(ggplot2)
starwars %>%
filter(species %in% c("Droid", "Ewok", "Gungan", "Human", "Hutt", "Wookiee")) %>%
ggplot(aes(height, species)) +
geom_boxplot() +
theme_bw()
```
This concludes the section about the different data sets available for every R user. Next, we'll take a look at some of the ggplot2 extensions that make it easier to do exploratory data analysis (EDA).
## Exploratory data analysis (EDA) {#sec-exploratory-data-analysis-eda}
Exploratory data analysis (or EDA, which we'll be using from now on) is a process, even if it is a loose one. The Cambridge Dictionary [@mcintosh2013] defines process as "a series of actions that you take in order to achieve a result". So, we a) have a series of actions and b) a result you wish to achieve.
Let's look at the results first. After all, that is why we do things. You might have other, more specific goals depending on your particular field or use case. But in the most general sense, the result we're after is a better understanding of the data (set) we're working on.
What are the series of actions we need to take? As I mentioned earlier, EDA is a loose process. There are as many ways to go about it as there are analysts and data sets. Still, some common steps usually occur: looking at **missing values**, **summarizing data**, and **visualizing relationships** between variables.
It's good to note these visualizations aren't usually meant for publication. Compared to those you find later on in the book, these are more for your eyes than for the eventual audience.
The last thing we'll look at in this chapter is one way to **automate** the EDA process using an app. Although I must warn you. It's better to use these tools only after you've gained experience from doing EDA without them. It might sound counter-intuitive, but trust me. It can be too overwhelming if you don't know what you're doing.
### Missing values {#sec-missing-values}
Let's first load the data. Looking at the movies data set earlier we noticed the **mpaa** column had many blank values. We don't know if they didn't have a rating in the first place. Or if they did, but the rating is missing. For the sake of this demonstration, let's assume they all should have a rating.
We'll begin by turning all the blank values (of the *character* type) into *NA* (not available).
```{r}
#| message: false
#| warning: false
library(dplyr)
library(ggplot2movies)
movies_na <- movies %>%
mutate(
# Turn all blank values of the character type into NA
across(where(is.character), ~na_if(., "")),
# Create a decade column for grouping based on the values in the year column
decade = floor(year / 10) * 10
)
```
**Naniar** [@tierney2024] is a package with many functions for visualizing missing (NA) values. It does contain many functions outside of visualization. But that's for another book.
One simple function is `gg_miss_var()`. It creates a lollipop plot (\[INSERT LINK HERE\]). It shows which columns (variables) contain the greatest amount of missing (NA) values.
```{r}
library(naniar)
movies_na %>%
gg_miss_var() +
theme_bw()
```
We can see that *MPAA ratings* aren't the only thing missing. Almost the same amount of films are missing the **budget** information. With budget, it's easier to say that if we don't have a number, it is missing. Then again, that column did have NAs in place from the beginning.
We're also interested in seeing if there is overlapping *missingness* between the columns. It can indicate patterns in the data. We'll use an upset plot (\[INSERT LINK HERE\]) for that. Just add `gg_miss_upset()`.
```{r}
movies_na %>%
gg_miss_upset(
# Number of sets to look at. We know there are only two columns with NA
nsets = 2
)
```
More than 3000 movies without an MPAA rating and almost the same amount without a budget. But over 50000 without both. That makes me think there is a consistency in the missingness.
We can also use `geom_miss_point()` to see if there are more patterns between the missing variables. Let's also use `label_number()` from the **scales** [@wickham2023e] package to prettify the x-axis labels.
```{r}
#| message: false
#| warning: false
library(ggplot2)
library(scales)
p1 <- movies_na %>%
ggplot(aes(budget, mpaa)) +
geom_point() +
geom_miss_point() +
scale_x_continuous(
labels = label_number(
scale = 1e-6,
prefix = "$",
suffix = "M"
)
) +
theme_bw()
```
```{r}
#| echo: false
#| message: false
#| warning: false
#| results: asis
p1
```
Values seem to be missing in all the MPAA rating categories. NC-17 does not seem to contain that many values in general, but many of them seem to be missing.
Let's confirm this observation by creating a frequency table with the `tabyl()` function. It's from a neat package called **janitor** [@firke2023].
```{r}
#| message: false
#| warning: false
#| results: asis
library(dplyr)
library(janitor)
movies_na %>%
filter(mpaa == "NC-17") %>%
tabyl(mpaa, budget) %>%
paged_table()
```
So, there are more NC-17 movies that are missing the budget than those that aren't.
We can dig even deeper. We can use `facet_wrap()` from ggplot2 to see how the missing values are distributed throughout the history.
```{r}
#| message: false
#| warning: false
#| results: asis
p1 +
scale_x_continuous(
n.breaks = 3,
labels = label_number(
scale = 1e-6,
prefix = "$",
suffix = "M"
)
) +
facet_wrap(vars(decade))
```
Before moving on, let's first look at one alternative for visualizing missing values, `plot_missing()`. It comes from the **DataExplorer** [@cui2024] package.
```{r}
#| message: false
#| warning: false
#| results: asis
library(DataExplorer)
movies_na %>%
plot_missing(
# Let's only visualize the missing values
missing_only = TRUE,
# We'll use theme_bw() when possible
ggtheme = theme_bw()
)
```
You see the amount and percentage of NAs per column, arranged in order of missingness. This is a great quick overview!
DataExplorer, like naniar earlier, provides functions for handling the missing values. For instance, we could drop the budget and mpaa columns, as indicated by the red color in the plot above.
### Summarizing data {#sec-summarizing-data}
Of course, we want to visualize more than the missing data. It makes sense to start with an overview of some kind.
Luckily we don't have to go further than the DataExplorer package. You know, the one we already tried in the previous chapter.
We'll start with the `plot_intro()` function. It gives us some basic information about our data set, visualized.
```{r}
#| message: false
#| warning: false
#| results: asis
movies_na %>%
plot_intro(ggtheme = theme_bw())
```
**Missing Values**
The most fascinating insights here are still related to missing values. Less than 5% of the rows in the data set are complete. Meaning that they have no missing values in any of the columns. And almost 10% of all the observations are missing. That's how much the missing rows of budget and mpaa are affecting the totals. One positive thing is that we don't have any columns that are missing all the values.
**Categorical Columns**
Let's then take a look at the categorical columns. The one that will be missing from the upcoming visualizations is **title**. The values are more or less unique. So, it wouldn't make sense having a bar chart, for instance, with more than 50000 rows with a count of 1 each.
Did someone mention bar charts? Let's start with the `plot_bar()` function to visualize the categorical columns. This will show us what the frequency of each of the values are within a column.
```{r}
#| message: false
#| warning: false
#| results: asis
movies_na %>%
plot_bar(ggtheme = theme_bw())
```
We see mpaa with it's missing values. The rest of the categorical columns are binary ones about which genres each movie belongs to. **Comedy** and **Drama** seem to be the best represented ones in the data set.
With the *with* argument we can choose to show something else on the x-axis besides the frequency. Here, for instance, we've chosen **length** to be the column to sum up.
```{r}
#| message: false
#| warning: false
#| results: asis
movies_na %>%
plot_bar(
# Name of continuous feature to be summed instead of NULL (i.e. frequency)
with = "length",
ggtheme = theme_bw()
)
```
Here we can see that **Animation** and **Short** make up an even smaller part of the whole as before. Which makes sense, when you start to think about it.
Another useful feature is using the *by* argument. It means we will see the frequency broken down by a discrete feature. In this case we've chosen mpaa.
```{r}
#| message: false
#| warning: false
#| results: asis
movies_na %>%
plot_bar(
by = "mpaa",
ggtheme = theme_bw()
)
```
This shows us that **Action** and **Romance** are the genres with more R rated movies than the other genres. Of course we must remember the prevalence of those missing values.
**Numerical Columns**
Next, we'll use `plot_histogram()` to draw a histogram of all the numerical columns (both discrete and continuous).
```{r}
#| message: false
#| warning: false
#| results: asis
movies_na %>%
plot_histogram(ggtheme = theme_bw())
```
That's a lot of information to take in all at once. For the next visualizations, we'll reduce the number of columns to look at.
**r1** to **r10** gives the percentile of users who rated the movie that number. Interesting information in its own right. But we'll concentrate on budget, **length**, rating, **votes**, and year. At least for the rest of this chapter.
There is one more thing we notice when looking at the histogram for length. Most of the values seem to be close to zero. But there are some extreme values around even 5000 that are skewing the view. Let's see what the cause is by looking at the title and length of the ten longest movies in the data set.
```{r}
movies_na %>%
select(title, length) %>%
slice_max(length, n = 10) %>%
paged_table()
```
So, a couple of cult classics. The Cure for Insomnia (5220 minutes) and The Longest Most Meaningless Movie in the World (2880 minutes) are in a category of their own. They are legit long films and not a mistake. Let's still create a new tibble that includes only those movies that have a length of under 500 minutes.
Also, we'll select the columns mentioned earlier. Budget, length, rating, votes, and year.
```{r}
#| message: false
#| warning: false
#| results: asis
library(dplyr)
movies_na_length_under_500 <- movies_na %>%
select(budget, length, rating, votes, year) %>%
filter(length < 500)
```
Let's see how the four columns, length, rating, votes, and year look like as a box plot when grouped by budget. Budget is a continuous variable. That's why `plot_boxplot()` groups the values to 5 equal ranges.
```{r}
#| message: false
#| warning: false
#| results: asis
movies_na_length_under_500 %>%
plot_boxplot(
by = "budget",
ggtheme = theme_bw(),
ncol = 4
)
```
Based on the box plot, the movies with a bigger budget tend to:
- have ratings inside a narrower range
- have better ratings
- have more votes
- be more recent
- be longer
Let's then take a look at what this all looks like as a scatter plot. We'll use all the same columns. It's important to use an *alpha* value that is low enough to reveal a more nuanced picture than we would see if it was 1.
```{r}
#| message: false
#| warning: false
#| results: asis
movies_na_length_under_500 %>%
plot_scatterplot(
by = "budget",
geom_point_args = list(alpha = 0.05),
ggtheme = theme_bw(),
ncol = 2
)
```
These are some of the ways to use DataExplorer to create a visual summary of a data set. I encourage you to dig into the documentation and see what else there is.
To conclude, DataExplorer has a function, `create_report()`. It creates a full EDA report from scratch. We won't use that now, because the report would take too much space. But if you're using DataExplorer, you should try it!
**gt & gtExtras**
DataExplorer is a good package, but it's not the only game in town for EDA. **gt** [@iannone2024] is a package to create tables (\[INSERT LINK HERE\]) with.
Often, when you think about tables, you think about numeric and text values. But that doesn't have to be the case. With **gtExtras** [@mock2023] you can create a fast and easy visual summary of your data. The `gt_plt_summary()` function is all you need.
The function does have some limitations. It has had some issues with certain data sets. The movies data set didn't work, for instance. That's why we'll take a look at the WorldPhones data set instead.
```{r}
#| message: false
#| warning: false
#| results: asis
library(gt)
library(gtExtras)
world_phones_tbl %>%
gt_plt_summary()
```
What we get is a table that combines a visualized plot overview with basic summary elements: *missing*, *mean*, *median*, and *standard deviation (sd)*). You can also see from the icon on the left, whether the column is categorical or continuous. With the categorical columns, you can see a list of the categorical values by clicking the arrow.
### Visualizing relationships {#sec-visualizing-relationships}
Most of the summarization techniques we’ve used so far have dealt with the distribution of values within one variable. But it’s also important to understand the possible relationships between different variables.
Understanding those relationships can provide insights into patterns, trends, and dependencies.
Two tools are available for this. A **correlation matrix** quantifies the relationships numerically. But since this book is about visualizing data, we’ll also visualize those numbers.
A **pairwise plot matrix** is a complementary visual tool to explore those relationships more deeply.
#### Correlation matrix {#sec-correlation-matrix}
First, let’s tweak the data set to better fit our demo purposes. We’ll *select* the columns we wish to use. In this case, it’s all the columns with numeric values. Except for the ones about the ratings percentiles (r1-r10). And we’ll use `drop_na()` from tidyr to drop rows where any column has missing values.
```{r}
#| message: false
#| warning: false
#| results: asis
library(dplyr)
library(tidyr)
movies_without_na_numeric <- movies_na %>%
select(where(is.numeric) & !num_range("r", 1:10)) %>%
# The second select() is to order the columns alphabetically.
# Except for the names of genres that follow everything else.
select(budget, length, rating, votes, year, everything()) %>%
drop_na()
```
```{r}
#| message: false
#| warning: false
#| results: asis
movies_without_na_numeric %>%
paged_table()
```
```{r}
#| echo: false
#| message: false
#| warning: false
# Get the number of columns
.movies_without_na_numeric_ncols <- movies_without_na_numeric %>%
ncol()
# Get the row count
.movies_without_na_numeric_rowcount <- movies_without_na_numeric %>%
tally()
```
We're left with `r .movies_without_na_numeric_rowcount` rows (observations) and `r .movies_without_na_numeric_ncols` columns (variables). Still plenty enough to perceive exciting correlations between the columns.
Let's first create a correlation matrix using the `cor()` function. It's from base R and computes the correlation between the different columns. We're using the **Pearson** correlation coefficient as the method (default). Remember it can't determine a potential non-linear association between variables.
Due to the lack of space on the page, we'll only look at these columns: budget, length, rating, votes, and year. If there was more room, we'd include the genre columns. But we'll see them again soon enough!
```{r}
#| message: false
#| warning: false
#| results: asis
corr <- movies_without_na_numeric %>%
cor() %>%
round(2)
```
```{r}
#| message: false
#| warning: false
#| results: asis
library(dplyr)
corr %>%
as.data.frame() %>%
select(budget, length, rating, votes, year) %>%
# We're using slice_head to filter only the first n rows. 5 in this case.
# Without it, we would also have rows with the genres, ruining our matrix.
slice_head(n = 5) %>%
paged_table()
```
Of course, we can see the differences by looking at the values in the matrix. But it might be easier to get the bigger picture by looking at them graphically. For that, we'll use **ggcorrplot** [@kassambra2023].
This time, we'll also include the genre columns. We're interested in how they correlate with each other and the original five.
In its basic form `ggcorrplot()` only needs the data argument. So let's first look at it without any modifications.
```{r}
#| message: false
#| warning: false
#| results: asis
library(ggcorrplot)
corr %>%
ggcorrplot()
```
So, the way to interpret this is simple. The redder the square, the stronger the positive correlation between the two variables. Votes and budget, for instance. Which makes intuitive sense. The bigger the movie (budget), the more likely it is to garner attention from IMDb users. If we look at budget and rating, there is no discernible correlation between the two. This is an intriguing initial finding and we could investigate it further.
But, the bluer the square, the stronger negative correlation there seems to be between the two. The most obvious example here is the negative correlation between length and Short. Who would have guessed that the films in the Short genre would also be short in length? We can also see a negative correlation between the Short genre and budget. Which again makes perfect sense. You can also see the reverse effect when looking at budget and length. It strengthens the case further.
We're still in the territory where most visualizations are for your eyes only. But there are some modifications we could make to the plot. Here are the parameters for `ggcorrplot()` we're using (and why):
- *Circle* is the other **method** available. The bigger the shape, the bigger the correlation (negative or positive). Double encoding (also known as redundant encoding) is a two-edged sword. We should use it with caution. But here it can help distinguish the most noteworthy cases
- We can use one of the ggplot2 themes by using **ggtheme**. Using a clean theme like *theme_bw* makes the plot a bit easier to read
- All variables are perfectly correlated with themselves. It doesn't make sense to show the diagonal. If anything, it makes reading the plot a bit harder. We can get rid of the diagonal by setting **show.diag** as *FALSE*
- Blue, white, and red are all fine **colors**. For a little more flavor, we can choose something else. Like *dark turquoise*, *grey*, and *dark orange*
```{r}
#| message: false
#| warning: false
#| results: asis
corr %>%
ggcorrplot(
method = "circle",
ggtheme = theme_bw,
show.diag = FALSE,
colors = c("darkturquoise", "grey95", "darkorange3")
)
```
Much better! And here are some more genre-related insights based on the matrix:
- Drama has a fairly strong positive correlation with both length and rating
- Action films have the strongest positive correlation with budget
- Comedy and Romance have a positive correlation
- But so does Romance and Drama
#### Pairwise plot matrix {#sec-pairwise-plot-matrix}
We already got a lot of information out of the correlation matrix. But as mentioned earlier, a pairwise plot matrix is a tool to explore the relationships within and between variables more deeply.
Again, we begin by fine tuning the data set. In this case we're only interested in the Drama genre and how those films compare to the rest of the films.
```{r}
#| message: false
#| warning: false
#| results: asis
movies_without_na_drama <- movies_na %>%
mutate(
Drama = if_else(Drama == 1, "Yes", "No") %>%
as.factor()
) %>%
drop_na() %>%
select(budget, length, rating, votes, year, Drama)
```
```{r}
#| message: false
#| warning: false
#| results: asis
movies_without_na_drama %>%
paged_table()
```
To use a pairwise plot matrix we turn to **GGally** [@schloerke2024].
Let's start with the basic `ggpairs()` function. We only need to tell it which columns we want to use. We'll start by looking at the dataset without the Drama column.
```{r}
#| message: false
#| warning: false
#| results: asis
library(GGally)
movies_without_na_drama %>%
ggpairs(
columns = c("budget", "length", "rating", "votes", "year")
) +
theme_bw()
```
The matrix consists of three areas: **upper**, **lower**, and *diagonal* (or **diag** when it comes to function parameters). They can contain different types of plots and are customizable. But let's first look at the defaults:
- Upper contains the correlation coefficient between the two variables
- Lower contains a scatter plot of the two variables
- Diagonal contains a density plot of the variable in question
It's time to add Drama as color for the matrix. And also do some customizations:
- We know Drama is either "No" or "Yes", so let's pick two colors that go well together, *dark orange* and *dark turquoise*
- We can use *functions* to customize any part of the matrix
- `lower_function()` creates a scatter plot where the points are somewhat transparent (alpha = 0.5). There is also a red linear trend line
- `diag_function()` creates a density plot. But why do we need to do that when the default already does the same? For some reason, default uses default colors, even if we map the colors as we have. But our function here does the trick
- Then we simply create the matrix
```{r}
#| message: false
#| warning: false
#| results: asis
library(ggplot2)
# 1) Choose your colors
manual_colors = c("darkorange3","darkturquoise")
# 2) Create functions for the lower half and the diagonal of the matrix
lower_function <- function(data, mapping) {
ggplot(data = data, mapping = mapping) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "red")
}
diag_function <- function(data, mapping) {
ggplot(data = data, mapping = mapping) +
geom_density()
}
# 3) Add colors. Then customize upper and lower half, and the diagonal
# of the matrix. Lower and diagonal use the functions we created in step 2
movies_without_na_drama %>%
ggpairs(
mapping = aes(color = Drama),
columns = c("budget", "length", "rating", "votes", "year"),
upper = list(continuous = wrap("cor", size = 3.5)),
lower = list(continuous = lower_function),
diag = list(continuous = diag_function)
) +
scale_color_manual(values = manual_colors) +
theme_bw()
```
There are many differences throughout the data depending on whether the movie belongs to the Drama genre.
If you want to, you can do the same with other genres!
There are many other ways to customize a pairwise plot matrix. For the rest, see `ggpairs()` [documentation](https://ggobi.github.io/ggally/articles/ggpairs.html).
### Automated EDA app {#sec-automated-eda-app}