Skip to content

Latest commit

 

History

History
299 lines (233 loc) · 9.57 KB

File metadata and controls

299 lines (233 loc) · 9.57 KB

parallel coordinate

contents

introduction
prerequisites
explore
exercises
references

introduction

“With a scatterplot, the x and y axes are perpendicular to one another. In a parallel coordinate plot all axes are parallel to one another. Each variable has its own individual vertical axis (or alternatively all the axes are horizontal) and the axis is usually scaled from the minimum to the maximum case values for the variable, so that the full range of each axis is used. The values of each case [each observation] are joined by lines, so that a polygonal line across all axes defines a case [an observation].” (Unwin, 2015)

Data characteristics

  • Any number of quantitative variables
  • One or more categorical variables (optional)
  • A key variable if data are not coordinatized

Graph characteristics

  • Axes are parallel
  • One axis per variable, usually scaled from the minimum to the maximum
  • One line per observation, running from point to point on the parallel axes

D6 Multivariate data and graph requirements


▲ top of page

prerequisites

Project setup

  • Start every work session by launching the RStudio Project file for the course, e.g., portfolio.Rproj
  • Ensure your project directory structure satisfies the course requirements

Ensure you have installed the following packages. See install packages for instructions if needed.

  • tidyverse: The ‘tidyverse’ is a set of packages that work in harmony because they share common data representations and ‘API’ design. This package is designed to make it easy to install and load multiple ‘tidyverse’ packages in a single step. Learn more about the ‘tidyverse’ at https://tidyverse.org.
  • graphclassmate: An R package with companion materials for a course in data visualization. The package provides data sets structured for a variety of graph types plus a ggplot2 theme.
  • GGally: The R package ‘ggplot2’ is a plotting system based on the grammar of graphics. ‘GGally’ extends ‘ggplot2’ by adding several functions to reduce the complexity of combining geometric objects with transformed data. Some of these functions include a pairwise plot matrix, a two group pairwise plot matrix, a parallel coordinates plot, a survival plot, and several functions to plot networks.
  • scagnostics: Calculates graph theoretic scagnostics. Scagnostics describe various measures of interest for pairs of variables, based on their appearance on a scatterplot. They are useful tool for discovering interesting or unusual scatterplots from a scatterplot matrix, without having to look at every individual plot.
  • gclus: Orders panels in scatterplot matrices and parallel coordinate displays by some merit index. Package contains various indices of merit, ordering functions, and enhanced versions of pairs and parcoord which color panels according to their merit level.

Scripts to initialize

explore/     0801-parallel-coord-explore.R  

And start the file with a minimal header

# your name
# date

# load packages
library("tidyverse")
library("graphclassmate")
library("GGally")

Duplicate the lines of code in the session one chunk at a time. Save, Source, and compare your results to the results shown.

http://ggobi.github.io/ggally/rd.html#ggparcoord


▲ top of page

explore

Open the explore script you initialized earlier. Load the package that has the data. We.re going to use the same Swiss banknote data we used in the scatterplot matrix practice.

library("gclus")
data(bank, package = "gclus")

bank <- as_tibble(bank) %>% 
  glimpse()
#> Observations: 200
#> Variables: 7
#> $ Status   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
#> $ Length   <dbl> 214.8, 214.6, 214.8, 214.8, 215.0, 215.7, 215.5, 214....
#> $ Left     <dbl> 131.0, 129.7, 129.7, 129.7, 129.6, 130.8, 129.5, 129....
#> $ Right    <dbl> 131.1, 129.7, 129.7, 129.6, 129.7, 130.5, 129.7, 129....
#> $ Bottom   <dbl> 9.0, 8.1, 8.7, 7.5, 10.4, 9.0, 7.9, 7.2, 8.2, 9.2, 7....
#> $ Top      <dbl> 9.7, 9.5, 9.6, 10.4, 7.7, 10.1, 9.6, 10.7, 11.0, 10.0...
#> $ Diagonal <dbl> 141.0, 141.7, 142.2, 142.0, 141.8, 141.4, 141.6, 141....

The basic parallel coordinate plot using GGally::ggparcoord()

ggparcoord(bank, columns = 2:7)

my_color <- c(rcb("dark_BG"),  rcb("dark_Br"))
my_fill  <- c(rcb("light_BG"), rcb("light_Br"))
my_title <- "Comparing Swiss banknote dimensions (mm)"

To use the bank not status *genuine or counterfeit) as a conditioning factor, we have to make it a factor.

bank <- bank %>%
  mutate(Status = factor(Status, labels = c("genuine", "counterfeit"))) %>% 
  drop_na() %>% 
  glimpse()
#> Observations: 200
#> Variables: 7
#> $ Status   <fct> genuine, genuine, genuine, genuine, genuine, genuine,...
#> $ Length   <dbl> 214.8, 214.6, 214.8, 214.8, 215.0, 215.7, 215.5, 214....
#> $ Left     <dbl> 131.0, 129.7, 129.7, 129.7, 129.6, 130.8, 129.5, 129....
#> $ Right    <dbl> 131.1, 129.7, 129.7, 129.6, 129.7, 130.5, 129.7, 129....
#> $ Bottom   <dbl> 9.0, 8.1, 8.7, 7.5, 10.4, 9.0, 7.9, 7.2, 8.2, 9.2, 7....
#> $ Top      <dbl> 9.7, 9.5, 9.6, 10.4, 7.7, 10.1, 9.6, 10.7, 11.0, 10.0...
#> $ Diagonal <dbl> 141.0, 141.7, 142.2, 142.0, 141.8, 141.4, 141.6, 141....

The possible values for the scale argument are

std           subtract mean and divide by standard deviation
robust        subtract median and divide by median absolute deviation
uniminmax     minimum of the variable is zero, and the maximum is one
globalminmax  no scaling
center        uniminmax then center each variable at scaleSummary
centerObs     uniminmax then center each variable at centerObsID

Using scale = robust gives us an idea of the measurements with outliers and reduces their effect

library("scagnostics")
ggparcoord(data = bank, columns = 2:7, groupColumn  = "Status",
           scale        = "robust", 
           # scaleSummary = "median", # use with  scale == “center”
           # centerObsID  = 1,        # use with scale == “centerObs”
           missing      = "exclude", 
           order        = "Skewed", # scagnostic measures 
           alphaLines   = 0.4, 
           mapping      = NULL, 
           title        = my_title) +
  labs(x = "", y = "")

The order of the quantitative variables along the x-axis is important for us to discern any stories. The scagnostic package has the following ordering functions:

Outlying 
Skewed 
Clumpy 
Sparse 
Striated 
Convex 
Skinny 
Stringy 
Monotonic

I selected skinny because of the shape of the distributions of the variables. From the scagnostics, the three top choices for the ordering are Skinny, Stringy, or Skewed.

bank2 <- bank %>% 
  gather(type, value, Length:Diagonal)

ggplot(bank2, aes(value, color = Status)) +
  geom_density() +
  facet_wrap(vars(type), scales = "free_x", as.table = FALSE)

library("scagnostics")
scag <- scagnostics(bank)

str(scag)
#>  'scagnostics' num [1:9, 1:21] 0.2302 0.9215 0.9559 0.0824 0.7097 ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : chr [1:9] "Outlying" "Skewed" "Clumpy" "Sparse" ...
#>   ..$ : chr [1:21] "Status * Length" "Status * Left" "Length * Left" "Status * Right" ...

attributes(scag)
#> $dim
#> [1]  9 21
#> 
#> $dimnames
#> $dimnames[[1]]
#> [1] "Outlying"  "Skewed"    "Clumpy"    "Sparse"    "Striated"  "Convex"   
#> [7] "Skinny"    "Stringy"   "Monotonic"
#> 
#> $dimnames[[2]]
#>  [1] "Status * Length"   "Status * Left"     "Length * Left"    
#>  [4] "Status * Right"    "Length * Right"    "Left * Right"     
#>  [7] "Status * Bottom"   "Length * Bottom"   "Left * Bottom"    
#> [10] "Right * Bottom"    "Status * Top"      "Length * Top"     
#> [13] "Left * Top"        "Right * Top"       "Bottom * Top"     
#> [16] "Status * Diagonal" "Length * Diagonal" "Left * Diagonal"  
#> [19] "Right * Diagonal"  "Bottom * Diagonal" "Top * Diagonal"   
#> 
#> 
#> $class
#> [1] "scagnostics"

names(scag)
#> NULL


▲ top of page

java error

If you have a character variable, and get a Java error like this

Error in newArray(jobj = x, ...) :
o does not refer to a java array

Not solved yet. Related to rJava and scagnostics but not sure how.

exercises

references

Unwin A (2015) Investigating multivariate continuous data. Graphical Data Analysis with R. CRC Press, Taylor & Francis, Boca Raton, FL, 99–130 http://www.gradaanwr.net/

Wickham H and Grolemund G (2017) R for Data Science. O’Reilly Media, Inc., Sebastopol, CA https://r4ds.had.co.nz/


▲ top of page
◁ calendar
◁ index