data.qmd

# Data {#sec-data}

```{r}
#| echo:    false
#| message: false
#| results: asis

source("ready.R")
```

```{r}
#| echo: false
#| message: false
#| warning: false

library(conflicted)
conflicts_prefer(base::`+`)
conflicts_prefer(dplyr::filter)
conflicts_prefer(radiant.basics::compare_means)
conflicts_prefer(rlang::set_names)
conflicts_prefer(shiny::dataTableOutput)
conflicts_prefer(shiny::renderDataTable)
```

## Data sets {#sec-data-sets}

This book is about tools for creating visualizations. But to visualize data you first need data. So let's start by taking a look at some of the data sets available to us without much hassle.

### base R {#sec-base-r}

Base R comes with a bunch of data sets ready to use. There are classics like **iris** and **mtcars**, but there are many more to choose from.

```{r}
#| echo:    false
#| message: false
#| warning: false
#| results: asis

library(dplyr)
library(purrr)
library(rmarkdown)

data() %>%
  pluck(3) %>%
  as_tibble() %>%
  filter(Package == "datasets") %>%
  select(Item, Title) %>%
  arrange(Item, .locale = "en") %>% 
  paged_table()
```

Since the **datasets** package comes from base R, the data is not always immediately ready to use with **ggplot2** [@wickham2024a]. Luckily we have the **tidyverse** [@wickham2023b] packages that make it easy to make the necessary changes.

Here is an example using the **WorldPhones** ('The World's Telephones') data set. We can start by loading the data set by using the `data()` function.

```{r}
data(WorldPhones)
```

Let's take a quick look at what the first couple of rows of the data set looks like.

```{r}
head(WorldPhones)
```

WorldPhones is a matrix with 7 rows and 8 columns. The columns give the figures for a given region, and the rows the figures for a year. We would like to turn it into a *tidy* format. We can use the **tibble** [@müller2023] package for the first part. And then we'll use `pivot_longer()` from the **tidyr** package [@wickham2024b]. It increases the number of rows and decrease the number of columns. We want the continents to be observations, not columns.

```{r}
library(dplyr)
library(tibble)
library(tidyr)

world_phones_tbl <- WorldPhones %>%
  as.data.frame() %>%
  rownames_to_column(var = "Year") %>%
  as_tibble() %>%
  pivot_longer(
    cols      = !Year,
    names_to  = "Continent",
    values_to = "Phones"
  ) %>% 
  mutate(across(where(is.character), as.factor))
```

```{r}
#| echo:    false
#| message: false
#| warning: false
#| results: asis

world_phones_tbl %>%
  paged_table()
```

What we're left with is a tibble with three columns, **Year**, **Continent**, and **Phones**. We can then use the new tibble to create a simple graph with ggplot2.

```{r}
#| message: false
#| warning: false
#| results: asis

library(ggplot2)

world_phones_tbl %>%
  ggplot(aes(Year, Phones, color = Continent, group = Continent)) +
  geom_line() +
  theme_bw()
```

### IMDb movies (1893-2005) {#sec-imdb-movies-1893-2005}

**ggplot2movies** [@wickham2015] used to be a part of the ggplot2 package itself. It’s now its own package to make ggplot2 lighter.

But it’s a cool little package. It has [Internet Movie Database (IMDb)](https://www.imdb.com/) data about movies from between 1893 and 2005. The selected movies have "a known length and had been rated by at least one \[IMDb\] user." [@wickham2015].

The **Movies** data set has qualities that make it good for our needs. Let’s start by loading it.

```{r}
#| message: false
#| warning: false
#| results: asis

library(ggplot2movies)

data(movies)
```

Let's take a quick look at what some of the data looks like.

```{r}
#| eval: false

head(movies)
```

```{r}
#| echo:    false
#| message: false
#| results: asis

head(movies) %>%
  paged_table()
```

```{r}
#| echo:    false
#| message: false
#| warning: false

# Get the number of columns
.movies_ncols <- movies %>% 
  ncol()

# Get the row count
.movies_rowcount <- movies %>% 
  tally()
```

Movies is already a tibble. It consists of `r .movies_rowcount` rows (observations) and `r .movies_ncols` columns (variables).

When starting to work with a new data set it's always good to take a look at the documentation. To understand what is in those rows and columns (and what is not).

```{r}
#| echo:    false
#| message: false
#| results: asis

library(tibble)

tribble(
  ~Variable,     ~Description,
  "title",       "Title of the movie",
  "year",        "Year of release",
  "budget",      "Total budget (if known) in US dollars",
  "length",      "Length in minutes",
  "rating",      "Average IMDB user rating",
  "votes",       "Number of IMDB users who rated this movie",
  "r1",          "Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 1",
  "r2",          "Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 2",
  "r3",          "Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 3",
  "r4",          "Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 4",
  "r5",          "Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 5",
  "r6",          "Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 6",
  "r7",          "Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 7",
  "r8",          "Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 8",
  "r9",          "Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 9",
  "r10",         "Multiplying by ten gives percentile (to nearest 10%) of users who rated this movie a 10",
  "mpaa",        "MPAA rating",
  "action",      "Binary variable representing if movie was classified as belonging to that genre",
  "animation",   "Binary variable representing if movie was classified as belonging to that genre",
  "comedy",      "Binary variable representing if movie was classified as belonging to that genre",
  "drama",       "Binary variable representing if movie was classified as belonging to that genre",
  "documentary", "Binary variable representing if movie was classified as belonging to that genre",
  "romance",     "Binary variable representing if movie was classified as belonging to that genre",
  "short",       "Binary variable representing if movie was classified as belonging to that genre"
) %>%
  paged_table()
```

Here are some of the reasons why Movies is a good example data set because it includes:

-   A goldilocks amount of data. Not too little, not too much
-   *Categorical* data of both *nominal* (title, genre) and *ordinal* (mpaa) kind
-   *Numerical* data of both *continuous* (budget, length, rating) and *discrete* (year, votes) kind

We can use two of those columns, **year** and **rating** to create a simple visualization with ggplot2.

```{r}
#| message: false
#| warning: false
#| results: asis

library(ggplot2)

movies %>%
  ggplot(aes(year, rating)) +
  geom_point(alpha = 0.05) +
  theme_bw()
```

As mentioned earlier, Movies is already a tibble. But, it doesn't mean that the data is in an optimal format for all kinds of visualization. But we'll do all the necessary data wrangling within the chapter where we use the data.

### RDatasets {#sec-rdatasets}

[**RDatasets**](https://vincentarelbundock.github.io/Rdatasets/articles/data.html) is not an R package. But it is an excellent GitHub repo. And a “collection of data sets originally distributed in various R packages" [@arel-bundock2024].

```{r}
#| echo:    false
#| message: false
#| warning: false

library(readr)
library(stringr)

# Get filename
filename <- list.files(
  path    = "data",
  pattern = "RDatasets"
)

# Extract the date
.RDatasets_date <- str_extract(filename, "\\d{4}-\\d{2}-\\d{2}")

# Get the data
.RDatasets <- read_csv2("data/RDatasets-2024-11-11.csv") %>%
  select(
    Package,
    Item,
    Title,
    Rows,
    Cols,
    n_binary,
    n_character,
    n_factor,
    n_logical,
    n_numeric
  )

# Get the row count
.RDatasets_rowcount <- .RDatasets %>% 
  tally()
```

Here listed are the `r .RDatasets_rowcount` data sets that were available on `r .RDatasets_date`.

```{r}
#| echo:    false
#| message: false
#| warning: false
#| results: asis

.RDatasets %>% 
  paged_table()
```

The RDatasets repo contains that same list. But there you will also find a .csv file and documentation for each data set.

If I had to choose one fun data set from the list to highlight, it would be **starwars** from the **dplyr** [@wickham2023c] package.

You can choose to use the .csv file provided on the website. Another way to use the collection is to choose the data set from the list and load the package it comes with:

```{r}
library(dplyr)

data(starwars)
```

Let's take a quick look at what some of the starwars data looks like.

```{r}
#| eval: false

head(starwars)
```

```{r}
#| echo:    false
#| message: false
#| results: asis

head(starwars) %>%
  paged_table()
```

There are a bunch of Star Wars characters and their stats.

Let's choose two columns, **height** and **species** (and filter for six of the more well-known species). We'll use them to create a simple visualization with ggplot2.

```{r}
#| message: false
#| warning: false
#| results: asis

library(ggplot2)

starwars %>%
  filter(species %in% c("Droid", "Ewok", "Gungan", "Human", "Hutt", "Wookiee")) %>%
  ggplot(aes(height, species)) +
  geom_boxplot() +
  theme_bw()
```

This concludes the section about the different data sets available for every R user. Next, we'll take a look at some of the ggplot2 extensions that make it easier to do exploratory data analysis (EDA).

## Exploratory data analysis (EDA) {#sec-exploratory-data-analysis-eda}

Exploratory data analysis (or EDA, which we'll be using from now on) is a process, even if it is a loose one. The Cambridge Dictionary [@mcintosh2013] defines process as "a series of actions that you take in order to achieve a result". So, we a) have a series of actions and b) a result you wish to achieve.

Let's look at the results first. After all, that is why we do things. You might have other, more specific goals depending on your particular field or use case. But in the most general sense, the result we're after is a better understanding of the data (set) we're working on.

What are the series of actions we need to take? As I mentioned earlier, EDA is a loose process. There are as many ways to go about it as there are analysts and data sets. Still, some common steps usually occur: looking at **missing values**, **summarizing data**, and **visualizing relationships** between variables.

It's good to note these visualizations aren't usually meant for publication. Compared to those you find later on in the book, these are more for your eyes than for the eventual audience.

The last thing we'll look at in this chapter is one way to **automate** the EDA process using an app. Although I must warn you. It's better to use these tools only after you've gained experience from doing EDA without them. It might sound counter-intuitive, but trust me. It can be too overwhelming if you don't know what you're doing.

### Missing values {#sec-missing-values-eda}

Let's first load the data. Looking at the movies data set earlier we noticed the **mpaa** column had many blank values. We don't know if they didn't have a rating in the first place. Or if they did, but the rating is missing. For the sake of this demonstration, let's assume they all should have a rating.

We'll begin by turning all the blank values (of the *character* type) into *NA* (not available).

```{r}
#| message: false
#| warning: false

library(dplyr)
library(ggplot2movies)

movies_na <- movies %>%
  mutate(
    # Turn all blank values of the character type into NA
    across(where(is.character), ~na_if(., "")), 
    # Create a decade column for grouping based on the values in the year column
    decade = floor(year / 10) * 10
  )
```

**Naniar** [@tierney2024] is a package with many functions for visualizing missing (NA) values. It does contain many functions outside of visualization. But that's for another book.

One simple function is `gg_miss_var()`. It creates a lollipop plot (\[INSERT LINK HERE\]). It shows which columns (variables) contain the greatest amount of missing (NA) values.

```{r}
library(naniar)

movies_na %>% 
  gg_miss_var() +
  theme_bw()
```

We can see that *MPAA ratings* aren't the only thing missing. Almost the same amount of films are missing the **budget** information. With budget, it's easier to say that if we don't have a number, it is missing. Then again, that column did have NAs in place from the beginning.

We're also interested in seeing if there is overlapping *missingness* between the columns. It can indicate patterns in the data. We'll use an upset plot (\[INSERT LINK HERE\]) for that. Just add `gg_miss_upset()`.

```{r}
movies_na %>% 
  gg_miss_upset(
    # Number of sets to look at. We know there are only two columns with NA
    nsets = 2
  )
```

More than 3000 movies without an MPAA rating and almost the same amount without a budget. But over 50000 without both. That makes me think there is a consistency in the missingness.

We can also use `geom_miss_point()` to see if there are more patterns between the missing variables. Let's also use `label_number()` from the **scales** [@wickham2023e] package to prettify the x-axis labels.

```{r}
#| message: false
#| warning: false

library(ggplot2)
library(scales)

p1 <- movies_na %>%
  ggplot(aes(budget, mpaa)) +
  geom_point() +
  geom_miss_point() +
  scale_x_continuous(
    labels = label_number(
      scale  = 1e-6,
      prefix = "$",
      suffix = "M"
    )
  ) +
  theme_bw()
```

```{r}
#| echo:    false
#| message: false
#| warning: false
#| results: asis

p1
```

Values seem to be missing in all the MPAA rating categories. NC-17 does not seem to contain that many values in general, but many of them seem to be missing.

Let's confirm this observation by creating a frequency table with the `tabyl()` function. It's from a neat package called **janitor** [@firke2023].

```{r}
#| message: false
#| warning: false
#| results: asis

library(dplyr)
library(janitor)

movies_na %>% 
  filter(mpaa == "NC-17") %>% 
  tabyl(mpaa, budget) %>%
  paged_table()
```

So, there are more NC-17 movies that are missing the budget than those that aren't.

We can dig even deeper. We can use `facet_wrap()` from ggplot2 to see how the missing values are distributed throughout the history.

```{r}
#| message: false
#| warning: false
#| results: asis

p1 +
  scale_x_continuous(
    n.breaks = 3,
    labels   = label_number(
      scale  = 1e-6,
      prefix = "$",
      suffix = "M"
    )
  ) +
  facet_wrap(vars(decade))
```

Before moving on, let's first look at one alternative for visualizing missing values, `plot_missing()`. It comes from the **DataExplorer** [@cui2024] package.

```{r}
#| message: false
#| warning: false
#| results: asis

library(DataExplorer)

movies_na %>%
  plot_missing(
    # Let's only visualize the missing values
    missing_only = TRUE,
    # We'll use theme_bw() when possible
    ggtheme      = theme_bw()
  )
```

You see the amount and percentage of NAs per column, arranged in order of missingness. This is a great quick overview!

DataExplorer, like naniar earlier, provides functions for handling the missing values. For instance, we could drop the budget and mpaa columns, as indicated by the red color in the plot above.

### Summarizing data {#sec-summarizing-data}

Of course, we want to visualize more than the missing data. It makes sense to start with an overview of some kind.

Luckily we don't have to go further than the DataExplorer package. You know, the one we already tried in the previous chapter.

We'll start with the `plot_intro()` function. It gives us some basic information about our data set, visualized.

```{r}
#| message: false
#| warning: false
#| results: asis

library(DataExplorer)
library(dplyr)

movies_na %>%
  plot_intro(ggtheme = theme_bw())
```

#### Missing Values {#sec-missing-values-summarizing-data}

The most fascinating insights here are still related to missing values. Less than 5% of the rows in the data set are complete. Meaning that they have no missing values in any of the columns. And almost 10% of all the observations are missing. That's how much the missing rows of budget and mpaa are affecting the totals. One positive thing is that we don't have any columns that are missing all the values.

#### Categorical Columns {#sec-categorical-columns}

Let's then take a look at the categorical columns. The one that will be missing from the upcoming visualizations is **title**. The values are more or less unique. So, it wouldn't make sense having a bar chart, for instance, with more than 50000 rows with a count of 1 each.

Did someone mention bar charts? Let's start with the `plot_bar()` function to visualize the categorical columns. This will show us what the frequency of each of the values are within a column.

```{r}
#| message: false
#| warning: false
#| results: asis

movies_na %>% 
  plot_bar(ggtheme = theme_bw())
```

We see mpaa with it's missing values. The rest of the categorical columns are binary ones about which genres each movie belongs to. **Comedy** and **Drama** seem to be the best represented ones in the data set.

With the *with* argument we can choose to show something else on the x-axis besides the frequency. Here, for instance, we've chosen **length** to be the column to sum up.

```{r}
#| message: false
#| warning: false
#| results: asis

movies_na %>% 
  plot_bar(
    # Name of continuous feature to be summed instead of NULL (i.e. frequency)
    with    = "length",
    ggtheme = theme_bw()
  )
```

Here we can see that **Animation** and **Short** make up an even smaller part of the whole as before. Which makes sense, when you start to think about it.

Another useful feature is using the *by* argument. It means we will see the frequency broken down by a discrete feature. In this case we've chosen mpaa.

```{r}
#| message: false
#| warning: false
#| results: asis

movies_na %>% 
  plot_bar(
    by      = "mpaa",
    ggtheme = theme_bw()
  )
```

This shows us that **Action** and **Romance** are the genres with more R rated movies than the other genres. Of course we must remember the prevalence of those missing values.

#### Numerical Columns {#sec-numerical-columns}

Next, we'll use `plot_histogram()` to draw a histogram of all the numerical columns (both discrete and continuous).

```{r}
#| message: false
#| warning: false
#| results: asis

movies_na %>%
  plot_histogram(ggtheme = theme_bw())
```

That's a lot of information to take in all at once. For the next visualizations, we'll reduce the number of columns to look at.

**r1** to **r10** gives the percentile of users who rated the movie that number. Interesting information in its own right. But we'll concentrate on budget, **length**, rating, **votes**, and year. At least for the rest of this chapter.

There is one more thing we notice when looking at the histogram for length. Most of the values seem to be close to zero. But there are some extreme values around even 5000 that are skewing the view. Let's see what the cause is by looking at the title and length of the ten longest movies in the data set.

```{r}
movies_na %>%
  select(title, length) %>%
  slice_max(length, n = 10) %>%
  paged_table()
```

So, a couple of cult classics. The Cure for Insomnia (5220 minutes) and The Longest Most Meaningless Movie in the World (2880 minutes) are in a category of their own. They are legit long films and not a mistake. Let's still create a new tibble that includes only those movies that have a length of under 500 minutes.

Also, we'll select the columns mentioned earlier. Budget, length, rating, votes, and year.

```{r}
#| message: false
#| warning: false
#| results: asis

movies_na_length_under_500 <- movies_na %>%
  select(budget, length, rating, votes, year) %>%
  filter(length < 500)
```

Let's see how the four columns, length, rating, votes, and year look like as a box plot when grouped by budget. Budget is a continuous variable. That's why `plot_boxplot()` groups the values to 5 equal ranges.

```{r}
#| message: false
#| warning: false
#| results: asis

movies_na_length_under_500 %>%
  plot_boxplot(
    by      = "budget",
    ggtheme = theme_bw(),
    ncol    = 4
  )
```

Based on the box plot, the movies with a bigger budget tend to:

-   have ratings inside a narrower range
-   have better ratings
-   have more votes
-   be more recent
-   be longer

Let's then take a look at what this all looks like as a scatter plot. We'll use all the same columns. It's important to use an *alpha* value that is low enough to reveal a more nuanced picture than we would see if it was 1.

```{r}
#| message: false
#| warning: false
#| results: asis

movies_na_length_under_500 %>%
  plot_scatterplot(
    by              = "budget",
    geom_point_args = list(alpha = 0.05),
    ggtheme         = theme_bw(),
    ncol            = 2
  )
```

These are some of the ways to use DataExplorer to create a visual summary of a data set. I encourage you to dig into the documentation and see what else there is.

To conclude, DataExplorer has a function, `create_report()`. It creates a full EDA report from scratch. We won't use that now, because the report would take too much space. But if you're using DataExplorer, you should try it!

#### gt & gtExtras {#sec-gt-gtextras}

DataExplorer is a good package, but it's not the only game in town for EDA. **gt** [@iannone2024] is a package to create tables (\[INSERT LINK HERE\]) with.

Often, when you think about tables, you think about numeric and text values. But that doesn't have to be the case. With **gtExtras** [@mock2023] you can create a fast and easy visual summary of your data. The `gt_plt_summary()` function is all you need.

The function does have some limitations. It has had some issues with certain data sets. The movies data set didn't work, for instance. That's why we'll take a look at the WorldPhones data set instead.

```{r}
#| message: false
#| warning: false
#| results: asis

library(gt)
library(gtExtras)

world_phones_tbl %>%
  gt_plt_summary()
```

What we get is a table that combines a visualized plot overview with basic summary elements: *missing*, *mean*, *median*, and *standard deviation (sd)*). You can also see from the icon on the left, whether the column is categorical or continuous. With the categorical columns, you can see a list of the categorical values by clicking the arrow.

### Visualizing relationships {#sec-visualizing-relationships}

Most of the summarization techniques we’ve used so far have dealt with the distribution of values within one variable. But it’s also important to understand the possible relationships between different variables.

Understanding those relationships can provide insights into patterns, trends, and dependencies.

Two tools are available for this. A **correlation matrix** quantifies the relationships numerically. But since this book is about visualizing data, we’ll also visualize those numbers.

A **pairwise plot matrix** is a complementary visual tool to explore those relationships more deeply.

#### Correlation matrix {#sec-correlation-matrix}

First, let’s tweak the data set to better fit our demo purposes. We’ll *select* the columns we wish to use. In this case, it’s all the columns with numeric values. Except for the ones about the ratings percentiles (r1-r10). And we’ll use `drop_na()` from tidyr to drop rows where any column has missing values.

```{r}
#| message: false
#| warning: false
#| results: asis

library(dplyr)
library(tidyr)

movies_without_na_numeric <- movies_na %>%
  select(where(is.numeric) & !num_range("r", 1:10)) %>%
  # The second select() is to order the columns alphabetically.
  # Except for the names of genres that follow everything else.
  select(budget, length, rating, votes, year, everything()) %>%
  drop_na()
```

```{r}
#| message: false
#| warning: false
#| results: asis

movies_without_na_numeric %>%
  paged_table()
```

```{r}
#| echo:    false
#| message: false
#| warning: false

# Get the number of columns
.movies_without_na_numeric_ncols <- movies_without_na_numeric %>% 
  ncol()

# Get the row count
.movies_without_na_numeric_rowcount <- movies_without_na_numeric %>% 
  tally()
```

We're left with `r .movies_without_na_numeric_rowcount` rows (observations) and `r .movies_without_na_numeric_ncols` columns (variables). Still plenty enough to perceive exciting correlations between the columns.

Let's first create a correlation matrix using the `cor()` function. It's from base R and computes the correlation between the different columns. We're using the **Pearson** correlation coefficient as the method (default). Remember it can't determine a potential non-linear association between variables.

Due to the lack of space on the page, we'll only look at these columns: budget, length, rating, votes, and year. If there was more room, we'd include the genre columns. But we'll see them again soon enough!

```{r}
#| message: false
#| warning: false
#| results: asis

corr <- movies_without_na_numeric %>% 
  cor() %>%
  round(2)
```

```{r}
#| message: false
#| warning: false
#| results: asis

corr %>%
  as.data.frame() %>%
  select(budget, length, rating, votes, year) %>%
  # We're using slice_head to filter only the first n rows. 5 in this case.
  # Without it, we would also have rows with the genres, ruining our matrix.
  slice_head(n = 5) %>%
  paged_table()
```

Of course, we can see the differences by looking at the values in the matrix. But it might be easier to get the bigger picture by looking at them graphically. For that, we'll use **ggcorrplot** [@kassambara2023].

This time, we'll also include the genre columns. We're interested in how they correlate with each other and the original five.

In its basic form `ggcorrplot()` only needs the data argument. So let's first look at it without any modifications.

```{r}
#| message: false
#| warning: false
#| results: asis

library(ggcorrplot)

corr %>%
  ggcorrplot()
```

So, the way to interpret this is simple. The redder the square, the stronger the positive correlation between the two variables. Votes and budget, for instance. Which makes intuitive sense. The bigger the movie (budget), the more likely it is to garner attention from IMDb users. If we look at budget and rating, there is no discernible correlation between the two. This is an intriguing initial finding and we could investigate it further.

But, the bluer the square, the stronger negative correlation there seems to be between the two. The most obvious example here is the negative correlation between length and Short. Who would have guessed that the films in the Short genre would also be short in length? We can also see a negative correlation between the Short genre and budget. Which again makes perfect sense. You can also see the reverse effect when looking at budget and length. It strengthens the case further.

We're still in the territory where most visualizations are for your eyes only. But there are some modifications we could make to the plot. Here are the parameters for `ggcorrplot()` we're using (and why):

-   *Circle* is the other **method** available. The bigger the shape, the bigger the correlation (negative or positive). Double encoding (also known as redundant encoding) is a two-edged sword. We should use it with caution. But here it can help distinguish the most noteworthy cases
-   We can use one of the ggplot2 themes by using **ggtheme**. Using a clean theme like *theme_bw* makes the plot a bit easier to read
-   All variables are perfectly correlated with themselves. It doesn't make sense to show the diagonal. If anything, it makes reading the plot a bit harder. We can get rid of the diagonal by setting **show.diag** as *FALSE*
-   Blue, white, and red are all fine **colors**. For a little more flavor, we can choose something else. Like *dark turquoise*, *grey*, and *dark orange*

```{r}
#| message: false
#| warning: false
#| results: asis

corr %>% 
  ggcorrplot(
    method    = "circle",
    ggtheme   = theme_bw,
    show.diag = FALSE,
    colors    = c("darkturquoise", "grey95", "darkorange3")
  )
```

Much better! And here are some more genre-related insights based on the matrix:

-   Drama has a fairly strong positive correlation with both length and rating
-   Action films have the strongest positive correlation with budget
-   Comedy and Romance have a positive correlation
-   But so does Romance and Drama

#### Pairwise plot matrix {#sec-pairwise-plot-matrix}

We already got a lot of information out of the correlation matrix. But as mentioned earlier, a pairwise plot matrix is a tool to explore the relationships within and between variables more deeply.

Again, we begin by fine tuning the data set. In this case we're only interested in the Drama genre and how those films compare to the rest of the films.

```{r}
#| message: false
#| warning: false
#| results: asis

movies_without_na_drama <- movies_na %>%
  mutate(
    Drama = if_else(Drama == 1, "Yes", "No") %>%
      as.factor()
  ) %>% 
  drop_na() %>%
  select(budget, length, rating, votes, year, Drama)
```

```{r}
#| message: false
#| warning: false
#| results: asis

movies_without_na_drama %>% 
  paged_table()
```

To use a pairwise plot matrix we turn to **GGally** [@schloerke2024].

Let's start with the basic `ggpairs()` function. We only need to tell it which columns we want to use. We'll start by looking at the data set without the Drama column.

```{r}
#| message: false
#| warning: false
#| results: asis

library(GGally)

movies_without_na_drama %>%
  ggpairs(
    columns = c("budget", "length", "rating", "votes", "year")
  ) +
  theme_bw()
```

The matrix consists of three areas: **upper**, **lower**, and *diagonal* (or **diag** when it comes to function parameters). They can contain different types of plots and are customizable. But let's first look at the defaults:

-   Upper contains the correlation coefficient between the two variables
-   Lower contains a scatter plot of the two variables
-   Diagonal contains a density plot of the variable in question

It's time to add Drama as color for the matrix. And also do some customizations:

-   We know Drama is either "No" or "Yes", so let's pick two colors that go well together, *dark orange* and *dark turquoise*
-   We can use *functions* to customize any part of the matrix
    -   `lower_function()` creates a scatter plot where the points are somewhat transparent (alpha = 0.5). There is also a red linear trend line
    -   `diag_function()` creates a density plot. But why do we need to do that when the default already does the same? For some reason, default uses default colors, even if we map the colors as we have. But our function here does the trick
-   Then we simply create the matrix

```{r}
#| message: false
#| warning: false
#| results: asis

library(ggplot2)

# 1) Choose your colors
manual_colors = c("darkorange3","darkturquoise")

# 2) Create functions for the lower half and the diagonal of the matrix
lower_function <- function(data, mapping) {
  ggplot(data = data, mapping = mapping) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "lm", color = "red")
}

diag_function <- function(data, mapping) {
  ggplot(data = data, mapping = mapping) +
    geom_density()
}

# 3) Add colors. Then customize upper and lower half, and the diagonal
# of the matrix. Lower and diagonal use the functions we created in step 2
movies_without_na_drama %>%
  ggpairs(
    mapping = aes(color = Drama),
    columns = c("budget", "length", "rating", "votes", "year"),
    upper   = list(continuous = wrap("cor", size = 3.5)),
    lower   = list(continuous = lower_function),
    diag    = list(continuous = diag_function)
  ) +
  scale_color_manual(values = manual_colors) +
  theme_bw()
```

There are many differences throughout the data depending on whether the movie belongs to the Drama genre.

If you want to, you can do the same with other genres!

There are many other ways to customize a pairwise plot matrix. For the rest, see `ggpairs()` [documentation](https://ggobi.github.io/ggally/articles/ggpairs.html).

### Automated EDA app {#sec-automated-eda-app}

Remember how I warned you about automating the EDA process using an app? We've spent some time learning how to perform EDA manually. Now is the time to explore some of the available automation tools: **ExPanDaR** [@gassen2020], **ggquickeda** [@mouksassi2024], and **radiant** [@nijs2024].

Let's start with the similarities:

-   Use **shiny** [@chang2024] to provide interactive web-based interfaces.
-   Cut the coding required. In most cases, you load your data and call a single function to start the app.
-   Enable interactive exploration. As you adjust inputs in the UI (select variables, filters, plot types, etc.), the output updates immediately in your browser.
-   Support exporting your results in some form.

They differ in the scope of analysis and the amount of available features. This also affects the composition of their respective target audiences. Let's take a closer look at each of them.

We'll once again use **movies_na_length_under_500** as the data set.

#### ExPanDaR {#sec-expandar}

ExPanDaR is especially good for panel data (with both cross-sectional and time-series dimensions). You can examine summary statistics, visualize distributions, and relationships. You can also perform linear regression analysis. ExPanDaR even allows for simple experimental tests to verify robustness. Splitting data into training and test samples, for example.

A key feature is the ability to share analyses without revealing the raw data. This enables researchers to assess the robustness of empirical results within the app. Or to export an R Notebook of the analysis for reproducibility.

For these reasons, ExPanDaR’s target users are academic researchers.

To start the ExPanD Shiny app:

```{r}
#| eval: false

library(ExPanDaR) 

movies_na_length_under_500 %>%
  ExPanD()
```

Here's an example screenshot of the **Scatter Plot** section of the app. You choose the variables with the dropdown menus. There can be extra choices to make. In this case, whether to use a smaller sample and display a smoother as seen in @sec-pairwise-plot-matrix.

![Figure 2.1: A similar scatter plot to the one we created with DataExplorer in @sec-numerical-columns](img/figure-2-1.png){.lightbox}

For a full working example of the app in action, visit [Exploring the Accrual Landscape by Open Science](https://jgassen.shinyapps.io/expacc/).

#### ggquickeda {#sec-ggquickeda}

ggquickeda enables a rapid exploration of your dataset by generating a range of plots (e.g., scatter plots, box plots, bar charts, histograms, and density plots) and summary tables.

The app also generates descriptive statistics tables for one or many variables. A notable feature is support for visualizing survival analysis. ggquickeda includes custom geoms and stats (e.g., geom_km) to plot Kaplan-Meier survival curves and confidence bands. This makes it especially handy for clinical or time-to-event data.

The target audience for ggquickeda is analysts and R users who want rapid insight into their data through visualization, but prefer a graphical user interface (GUI) approach. ggquickeda includes an **Export Plot Code** feature that reveals the underlying ggplot2 code for any graph you create. This makes it easy to customize the plots further outside the app.

```{r}
#| eval: false

library(ggquickeda)

movies_na_length_under_500 %>%
  run_ggquickeda()
```

Here's an example screenshot of the **X/Y Plot** section of the app. There are more choices to make compared to ExPanDaR.

![Figure 2.2: Another similar scatter plot to the one we created with DataExplorer in @sec-numerical-columns](img/figure-2-2.png){.lightbox}

For a more comprehensive introduction to the features, there's an [hour-long webinar from Certara R School](https://certara.github.io/R-Certara/articles/lesson_3.html).

#### radiant {#sec-radiant}

Radiant isn't only an EDA tool but a comprehensive Shiny application that covers the entire analytics workflow. From data import and cleaning to exploration, modeling, and even decision-making. For the scope of this book, we'll concentrate on the EDA part.

Radiant combines several modular packages (radiant.data, radiant.design, radiant.basics, radiant.model, radiant.multivariate) into one unified app.

Radiant is for business analysts, students, and instructors in business and statistics. It’s used in classrooms, allowing learners to focus on concepts rather than syntax.

Radiant can generate an R Markdown report of all your analysis steps or let you save the state of the app. The state file is a snapshot of all inputs and selections, which you or others can reload later to reproduce the exact results.

```{r}
#| eval: false

library(radiant)

# Note that you can't provide radiant with the data set in the code. 
# We'll do it in the user interface instead.
radiant()
```

Since we can't provide radiant with the data set in the code, we'll do it inside the GUI:

![Figure 2.3: You can load data from any data frame in your Global Environment](img/figure-2-3.png){.lightbox}

Here's an example screenshot of the **Visualize** section of the app. The app design is closer to ggquickeda than ExPanDaR.

![Figure 2.4: Yet another similar scatter plot to the one we created with DataExplorer in @sec-numerical-columns](img/figure-2-4.png){.lightbox}

For a more comprehensive introduction to the features, check the [Radiant Tutorial Video Series](https://radiant-rstats.github.io/docs/radiant-tutorial-series.html).

#### Comparative summary {#sec-comparative-summary}

In conclusion, each package excels in a different niche. They all leverage the power of R through an interactive interface, so the *best* choice depends on the use case. Choose the one whose design philosophy matches your analysis needs.

| Package | Focus & Scope | Unique Strengths | Ideal Use Cases |
|------------------|------------------|------------------|-------------------|
| **ExPanDaR** | Particularly for panel data (longitudinal data sets) | Robust modeling features, reproducible reports | Academic research, teaching, panel data analysis |
| **ggquickeda** | Quick visualization and summary statistics | Quick plots, Kaplan-Meier survival curves | Quick EDA, initial data inspections, simple visualization |
| **radiant** | Comprehensive business analytics platform | Broad analytics toolkit, decision analysis | Business analytics, education, end-to-end analysis workflow |