title | author | date | output |
---|---|---|---|
Data Wrangling |
Ahmed Al-Hindawi & Danny Wong |
Last updated: 28 March 2018 |
ioslides_presentation |
- Use the dplyr package to manipulate your data
- Introduce logical operators
- The standard R methods for selecting data
- Some of our favourite (data wrangling) things
- Wrangling strings
- Wrangling dates
- Columns to rows and back again
- Install the packages using
install.packages()
:dplyr
lubridate
- Load the libraries
- Import the RCT dataset into an object called
RCT
Type this into the console:
library(dplyr)
filter(RCT, age >= 65)
Done! It filters rows from the RCT
data frame where age
is >= 65
.
- Most are obvious:
>
(greater than),>=
(greater than or equal to), and similarly for<
and<=
.
- Others are less so:
!=
means Not Equal to...
- Others are confusing:
==
compares equality. Notice that there are two equal signs. This is because in R=
means an assigment, you're making somethign equal to something else:x = 6
means make the variable calledx
equal 6. If you then dox == 8
is a question, is x equal to 8? Here, the answer is aFALSE
RCT_1 <- filter(RCT, age >= 65)
RCT_2 <- select(RCT_1, gender)
table(RCT_2)
table(select(filter(RCT, age >= 65), gender))
- First we filter the rows by a criterion:
- we are interested in patients 65 and older (
>= 65
)
- we are interested in patients 65 and older (
- We then assign that to an named object
RCT_1
- We then select the gender column and assign that to
RCT_2
, and create a table from that - What we have left is a single column of Male/Female from a subset of our data - namely ones that have an Age and are 65 and older
filter(RCT, age >= 65)
RCT %>% filter(age >= 65)
# RCT_1 <- filter(RCT, age >= 65)
# RCT_2 <- select(RCT_1, gender)
# table(RCT_2)
#
# table(select(filter(RCT, age >= 65), gender))
The %>%
operator (created by the dplyr library) is called a pipe, and it (surprise, surprise) pipes data from one command to the next. So in plain English, the above line filters the data where the age
is >= 65, then selects the Gender.
Now that we have our data's subset, we can pass it onto other functions in R:
RCT %>% filter(age >= 65) %>%
select(gender) %>% table()
This says, grab my data labeled RCT
, filter the rows so that we only find patients who have an Age and are 65 and older, select the column called gender
. With that column, make me a summary.
- In Addition to filter and select, there are:
arrange
--> sorts rowsdistinct
--> finds unique valuesmutate
--> creates a new column based on some parameters. Hint: you can use other column names here, useful for finding out the Age of patients, length of time they were in hospital...etcrecode
--> helps you recode values which might have been mislabelledgroup_by
--> This one is great!!
RCT %>% arrange(age)
RCT %>% arrange(desc(age))
RCT %>% select(gender) %>% distinct()
RCT %>% filter(age >= 65) %>% select(gender) %>%
mutate(gender = recode(gender, f = "F"))
It breaks down a dataset into specified groups of rows. When you then apply the verbs above on the resulting object they’ll be automatically applied “by group”. Most importantly, all this is achieved by using the same exact syntax you’d use with an ungrouped object.
RCT %>% group_by(gender) %>% summarise(age_avg = mean(age))
Sadly this won't work because mean
has a little hissy fit if there are NA's in the data; fix:
RCT %>% group_by(gender) %>%
summarise(age_avg = mean(age, na.rm = TRUE))
How do we combine all these steps to get insight into the data?
Since this RCT looking at the effects of having a drain vs having skin infiltration on postop pain, let's see what the mean change in pain scores is at 24h vs baseline:
RCT %>% mutate(ps_change = ps24h - ps0h) %>%
group_by(random) %>%
summarise(mean_ps_change = mean(ps_change, na.rm = TRUE))
RCT$ps_change <- RCT$ps24h - RCT$ps0h
drain <- filter(RCT, random == "drain")
mean(drain$ps_change, na.rm = TRUE)
skin <- filter(RCT, random == "skin")
mean(skin$ps_change, na.rm = TRUE)
- Open the course website: http://datascibc.org/Data-Science-for-Docs/
- Select Lesson: Data Wrangling
- Select Some of our favourite (data wrangling) things and do some practice