6. Descriptive Statistics.Rmd

---
title: "Descriptive Statistics"
author: Daphne Janelyn L. Go^[De La Salle University, Manila, Philippines, daphne_janelyn_go@dlsu.edu.ph]
output: html_notebook
---

Descriptive statistics are measures that summarize important features of data, often with a single number. Producing descriptive statistics is a common first step to take after cleaning and preparing a data set for analysis.

```{r}
dataset <- read.delim("phages.tsv")
```

**Mean, Median, Mode, and Range**

```{r}
mean(dataset$molGC...)
```

```{r}
# colMeans() gets the means for all columns in a data frame
# colMeans(dataset) # generates an error because not all columns have continuous data

colMeans(mtcars)
```

```{r}
# rowMeans() gets the means for all rows in a data frame
rowMeans(mtcars)
```

```{r}
# Get only first 6 rows
head(rowMeans(mtcars))
```

```{r}
median(dataset$molGC....)
```

```{r}
colMedians <- apply(mtcars,
        MARGIN = 2, # Operate on columns
        FUN = median
) # Use function median
colMedians
```

```{r}
range(dataset$molGC....)
```

```{r}
max(dataset$molGC....)
```

```{r}
min(dataset$molGC....)
```

**Variance and standard deviation**

The variance of a distribution is the average of the squared deviations (differences) from the mean. Use the built-in function var() to check variance.

```{r}
var(dataset$molGC....)
```

The standard deviation is the square root of the variance. Use sd() to check the standard deviation.

```{r}
sd(dataset$molGC....)
```

**Quartiles and Interquartile Ranges**

Quartiles divide a dataset into four equal parts. The first quartile (Q1) is the value below which 25% of the data falls, the second quartile (Q2) is the median, and the third quartile (Q3) is the value below which 75% of the data falls.

The interquartile range is the range between the first quartile (Q1) and the third quartile (Q3). It represents the spread of the middle 50% of the data.

```{r}
# Compute for Quartiles.
q <- quantile(dataset$molGC...., )
q1 <- quantile(dataset$molGC...., 0.25)
q2 <- quantile(dataset$molGC...., 0.50) # Median
q3 <- quantile(dataset$molGC...., 0.75)

print(q)
print(q1)
print(q2)
print(q3)
```

```{r}
# Get five number summary
fivenum(dataset$molGC....)
```

```{r}
# Summary() shows the five number summary plus the mean
summary(dataset$molGC....)
```

The quantile() function also lets you check percentiles other than common ones that make up the five number summary. To find percentiles, pass a vector of percentiles to the probs argument.

```{r}
quantile(dataset$molGC....,
        probs = c(0.1, 0.9)
) # get the 10th and 90th percentiles
```

Interquartile (IQR) range is another common measure of spread. IQR is the distance between the 3rd quartile and the 1st quartile, which encompasses half the data. R has a built in IQR() fuction.

```{r}
IQR(dataset$molGC....)
```

The boxplots are just visual representations of the five number summary and IQR.

```{r}
five_num <- fivenum(dataset$molGC....)

boxplot(dataset$molGC....)

text(x = five_num[1], adj = 2, labels = "Minimum")
text(x = five_num[2], adj = 2.3, labels = "1st Quartile")
text(x = five_num[3], adj = 3, labels = "Median")
text(x = five_num[4], adj = 2.3, labels = "3rd Quartile")
text(x = five_num[5], adj = 2, labels = "Maximum")
text(x = five_num[3], adj = c(0.5, -8), labels = "IQR", srt = 90, cex = 2)
```