-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path6. Descriptive Statistics.Rmd
130 lines (97 loc) · 3.26 KB
/
6. Descriptive Statistics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
title: "Descriptive Statistics"
author: Daphne Janelyn L. Go^[De La Salle University, Manila, Philippines, [email protected]]
output: html_notebook
---
Descriptive statistics are measures that summarize important features of data, often with a single number. Producing descriptive statistics is a common first step to take after cleaning and preparing a data set for analysis.
```{r}
dataset <- read.delim("phages.tsv")
```
**Mean, Median, Mode, and Range**
```{r}
mean(dataset$molGC...)
```
```{r}
# colMeans() gets the means for all columns in a data frame
# colMeans(dataset) # generates an error because not all columns have continuous data
colMeans(mtcars)
```
```{r}
# rowMeans() gets the means for all rows in a data frame
rowMeans(mtcars)
```
```{r}
# Get only first 6 rows
head(rowMeans(mtcars))
```
```{r}
median(dataset$molGC....)
```
```{r}
colMedians <- apply(mtcars,
MARGIN = 2, # Operate on columns
FUN = median
) # Use function median
colMedians
```
```{r}
range(dataset$molGC....)
```
```{r}
max(dataset$molGC....)
```
```{r}
min(dataset$molGC....)
```
**Variance and standard deviation**
The variance of a distribution is the average of the squared deviations (differences) from the mean. Use the built-in function var() to check variance.
```{r}
var(dataset$molGC....)
```
The standard deviation is the square root of the variance. Use sd() to check the standard deviation.
```{r}
sd(dataset$molGC....)
```
**Quartiles and Interquartile Ranges**
Quartiles divide a dataset into four equal parts. The first quartile (Q1) is the value below which 25% of the data falls, the second quartile (Q2) is the median, and the third quartile (Q3) is the value below which 75% of the data falls.
The interquartile range is the range between the first quartile (Q1) and the third quartile (Q3). It represents the spread of the middle 50% of the data.
```{r}
# Compute for Quartiles.
q <- quantile(dataset$molGC...., )
q1 <- quantile(dataset$molGC...., 0.25)
q2 <- quantile(dataset$molGC...., 0.50) # Median
q3 <- quantile(dataset$molGC...., 0.75)
print(q)
print(q1)
print(q2)
print(q3)
```
```{r}
# Get five number summary
fivenum(dataset$molGC....)
```
```{r}
# Summary() shows the five number summary plus the mean
summary(dataset$molGC....)
```
The quantile() function also lets you check percentiles other than common ones that make up the five number summary. To find percentiles, pass a vector of percentiles to the probs argument.
```{r}
quantile(dataset$molGC....,
probs = c(0.1, 0.9)
) # get the 10th and 90th percentiles
```
Interquartile (IQR) range is another common measure of spread. IQR is the distance between the 3rd quartile and the 1st quartile, which encompasses half the data. R has a built in IQR() fuction.
```{r}
IQR(dataset$molGC....)
```
The boxplots are just visual representations of the five number summary and IQR.
```{r}
five_num <- fivenum(dataset$molGC....)
boxplot(dataset$molGC....)
text(x = five_num[1], adj = 2, labels = "Minimum")
text(x = five_num[2], adj = 2.3, labels = "1st Quartile")
text(x = five_num[3], adj = 3, labels = "Median")
text(x = five_num[4], adj = 2.3, labels = "3rd Quartile")
text(x = five_num[5], adj = 2, labels = "Maximum")
text(x = five_num[3], adj = c(0.5, -8), labels = "IQR", srt = 90, cex = 2)
```