Francisco Yira Albornoz
- 19.2 When you should write a function
- 19.3 Functions are for humans and computers
- 19.4 Conditional execution
- 19.4 Function arguments
- Why is
TRUE
not a parameter torescale01()
? What would happen ifx
contained a single missing value, andna.rm
wasFALSE
?
It’s not a parameter because after loading the function is not possible
to change the value TRUE
through the arguments supplied in the
function calling.
If na.rm
was FALSE
and x
contained a single missing value, then
the output would be NA
.
- In the second variant of
rescale01()
, infinite values are left unchanged. Rewriterescale01()
so that-Inf
is mapped to0
, andInf
is mapped to1
.
x <- c(1:10, Inf, -Inf)
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE, finite = TRUE)
y <- (x - rng[1]) / (rng[2] - rng[1])
if (any(y == Inf)) y[y == Inf] <- 1
if (any(y == -Inf)) y[y == -Inf] <- 0
y
}
rescale01(x)
## [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
## [8] 0.7777778 0.8888889 1.0000000 1.0000000 0.0000000
- Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? Can you rewrite it to be more expressive or less duplicative?
mean(is.na(x))
## [1] 0
prop_na <- function(x) mean(is.na(x))
x / sum(x, na.rm = TRUE)
## [1] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
prop_total <- function(x) x / sum(x, na.rm = TRUE)
prop_total2 <- function(x) {
total <- sum(x, na.rm = TRUE)
x / total
}
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
## [1] NaN
coef_variation <- function(x) sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
coef_variation2 <- function(x) {
sd_x <- sd(x, na.rm = TRUE)
mean_x <- mean(x, na.rm = TRUE)
sd_x / mean_x
}
- Follow http://nicercode.github.io/intro/writing-functions.html to write your own functions to compute the variance and skew of a numeric vector.
my_var <- function(x) {
mean_x <- mean(x)
n <- length(x)
sum((x - mean_x)^2) / (n - 1)
}
my_skewness <- function(x) {
mean_x <- mean(x)
var_x <- var(x)
sqr_x <- x^(1/2)
my_skew <- sqr_x * sum((x-mean_x)^3) / ((var_x)^(3/2))
}
- Write
both_na()
, a function that takes two vectors of the same length and returns the number of positions that have anNA
in both vectors.
both_na <- function(x, y) {
is_na_both <- is.na(x) & is.na(y)
sum(is_na_both)
}
a <- c(NA, 1:6, NA, 8)
b <- c(1:7, NA, 8)
both_na(a, b)
## [1] 1
- What do the following functions do? Why are they useful even though they are so short?
is_directory <- function(x) file.info(x)$isdir
is_readable <- function(x) file.access(x, 4) == 0
The first function is_directory
checks if a character vector is a
valid directory in the file system.
The second function is_readable
returns TRUE
when we have read
permissions for the file specified in the character vector allows, and
FALSE
otherwise.
They are both useful because their names communicates more clearly the intent of the code (versus using the code inside them without the function call).
- Read the complete lyrics to “Little Bunny Foo Foo”. There’s a lot of duplication in this song. Extend the initial piping example to recreate the complete song, and use functions to reduce the duplication.
The original song:
Little bunny Foo Foo Hopping through the forest Scooping up the field mice And bopping them on the head
Down came the Good Fairy, and she said “Little bunny Foo Foo I don’t want to see you Scooping up the field mice And bopping them on the head. I’ll give you three chances, And if you don’t stop, I’ll turn you into a GOON!”
foo_foo %>%
hop(through = forest) %>%
scoop(up = field_mice) %>%
bop(on = head)
good_fairy %>%
came(direction = down) %>%
want(!is_scoop(foo_foo, up = field_mice),
!is_bop(foo_foo, on = head)) %>%
give(n_chances = 3,
if_not = turn_into_goon())
- Read the source code for each of the following three functions, puzzle out what they do, and then brainstorm better names.
f1 <- function(string, prefix) {
substr(string, 1, nchar(prefix)) == prefix
}
f1("hello", "he")
## [1] TRUE
f1("precedente", "pre")
## [1] TRUE
This function tell us if a string starts with a certain character
sequence (a prefix). Better names could be starts_with
or
has_prefix
.
f2 <- function(x) {
if (length(x) <= 1) return(NULL)
x[-length(x)]
}
f2(1)
## NULL
f2(1:10)
## [1] 1 2 3 4 5 6 7 8 9
This function removes the last element of input vector. Better names
could be remove_last
, remove_tail
or minus_tail
.
f3 <- function(x, y) {
rep(y, length.out = length(x))
}
f3(1:10, 2)
## [1] 2 2 2 2 2 2 2 2 2 2
f3(1:10, 1:3)
## [1] 1 2 3 1 2 3 1 2 3 1
f3(1:10, 1:20)
## [1] 1 2 3 4 5 6 7 8 9 10
This function repeats (recycles) the input vector y
until the output
matches the length of the input vector x
. Descriptive names for this
function could be make_eq_length
, as_long_as
,
rep_until_same_length
.
- Take a function that you’ve written recently and spend 5 minutes brainstorming a better name for it and its arguments.
run_sql <- function(conn, path){
library(stringr)
library(DBI)
library(odbc)
query <- str_flatten(readLines(path), collapse ="\n")
as_tibble(dbGetQuery(conn, query))
}
Maybe run_query
is a better name for this function. Also, argument
connection
could be a better argument than conn
(better be explicit
than hoping than everyone will figure out what conn
stands for).
Finally , the argumentpath
could be renamed to path_to_query
to
avoid confusion about which path is required.
- Compare and contrast
rnorm()
andMASS::mvrnorm()
. How could you make them more consistent?
Both functions have 3 common arguments (number of observations to
generate, mean, and standard deviation). However, some of them are
optional in one but not in the other, and also the mean and standard
devation argument has different name in each function (mean
and sd
in rnorm()
, and mu
and Sigma
in MASS:mvrnorm()
).
An idea to make them more consistent would be change argument name
Sigma
to sd
in MASS:mvrnorm()
,
- Make a case for why
norm_r()
,norm_d()
etc would be better thanrnorm()
,dnorm()
. Make a case for the opposite.
Ussing a common prefix for related functions is a good idea because when users type the prefix in RStudio, it will trigger the auto-complete pop-up, which will help the user to find those related functions even if they don’t remember the exact name.
However, the common prefix could make more easy to miss the difference between the related functions (when writing or reading code). Maybe a longer, more descriptive suffix would solve this potential issue.
- What’s the difference between
if
andifelse()
? Carefully read the help and construct three examples that illustrate the key differences
ifelse()
is a vectorized control flow function, and allows using a
condition vector (argument test
) with length greater than 1. It will
evaluate the condition element-wise, and pick elements from other two
vectors (yes
and no
) depending on the result. The output will be a
vector with the same properties than the condition vector (same
dimension and class).
if
will only use a length 1 condition vector. However, it allows us to
use more complex conditional expressions (for example, write a file to
disk, install a package, etc).
Example 1: if we want vectorized evauation, we need to use ifelse()
.
numbers <- 1:10
if (numbers %% 2 == 0) {
"even"
} else {
"odd"
}
## Warning in if (numbers%%2 == 0) {: the condition has length > 1 and only the
## first element will be used
## [1] "odd"
ifelse(numbers %% 2 == 0, "even", "odd")
## [1] "odd" "even" "odd" "even" "odd" "even" "odd" "even" "odd" "even"
Example 2: if
allows us to execute expressions which generate
side-effects insted of returning values.
if (exists("mtcars")) saveRDS(mtcars, "mtcars.rds")
Example 3: ifelse()
allows NA
in the condition vector (they simply
become NA
in the output vector).
animals <- c(NA, "dog", "cat", "parrot")
ifelse(animals == "dog",
"love him",
"not sure")
## [1] NA "love him" "not sure" "not sure"
- Write a greeting function that says “good morning”, “good afternoon”, or “good evening”, depending on the time of day. (Hint: use a time argument that defaults to lubridate::now(). That will make it easier to test your function.)
greeting <- function(time = lubridate::now()) {
if (lubridate::hour(time) <= 12) {
"Good morning"
} else if (lubridate::hour(time) <= 18) {
"Good afternoon"
} else {
"Good evening"
}
}
greeting()
## [1] "Good afternoon"
- Implement a
fizzbuzz
function. It takes a single number as input. If the number is divisible by three, it returns “fizz”. If it’s divisible by five it returns “buzz”. If it’s divisible by three and five, it returns “fizzbuzz”. Otherwise, it returns the number. Make sure you first write working code before you create the function.
fizzbuzz <- function(x) {
if (x %% 15 == 0) {
"fizzbuzz"
} else if (x %% 3 == 0) {
"fizz"
} else if (x %% 5 == 0) {
"buzz"
} else {
x
}
}
- How could you use
cut()
to simplify this set of nested if-else statements?
if (temp <= 0) {
"freezing"
} else if (temp <= 10) {
"cold"
} else if (temp <= 20) {
"cool"
} else if (temp <= 30) {
"warm"
} else {
"hot"
}
cut(
temp,
breaks = c(-Inf, 0, 10, 20, 30, Inf),
labels = c("freezing", "cold", "cool", "warm", "hot")
)
How would you change the call to cut()
if I’d used < instead of <=?
cut(
temp,
breaks = c(-Inf, 0, 10, 20, 30, Inf),
labels = c("freezing", "cold", "cool", "warm", "hot"),
include.lowest = TRUE
)
What is the other chief advantage of cut()
for this problem? (Hint:
what happens if you have many values in temp
?)
cut()
is a vectorized function, so if temp
has length greater than
1, then we will still get a vector with the variable recoded following
our criteria. However, with the if/else solution, only the first element
of temp
will be recoded.
- What happens if you use
switch()
with numeric values?
It returns the element in the corresponding position in the list of
alternatives (i.e. a 3
wil return the third element of the list, even
if there is another one named 3
).
- What does this
switch()
call do? What happens ifx
is “e”?
switch(x,
a = ,
b = "ab",
c = ,
d = "cd"
)
Documentation says that if the matching element is missing, then the
next one is evaluated, which means that when x = "a"
, the element
b = "ab"
will be returned. If there are no more elements, a NULL
value will be returned invisibly.
- What does
commas(letters, collapse = "-")
do? Why?
It throws an error. The ...
argument in commas
passes all the
arguments (including collapse = "-"
) to ...
inside
stringr::str_c
. This causes duplication of values for the argument
collapse
inside str_c
, and then the function doesn’t know which one
to use.
commas <- function(...) stringr::str_c(..., collapse = ", ")
commas(letters, collapse = "-")
- It’d be nice if you could supply multiple characters to the
pad
argument, e.g.rule("Title", pad = "-+")
. Why doesn’t this currently work? How could you fix it?
It doesn’t work because the function calculates the number of
repetitions asuming that pad
has length of 1, so if we use multiple
arguments in the argument we will get a result of more than one line due
to the extra characters.
What is needed is to obtain the length
of pad
inside the function,
and then use this number in the width
calcultation.
- What does the
trim
argument tomean()
do? When might you use it?
It allows to compute a trimmed mean by removing (trimming) a fraction
trim
of observations in both sides of the input vector. If the vector
is ordered, the elements removed are likely to be extreme values, so web
obtain a number less susceptible to sample fluctuation than the
arithmethic mean.
http://davidmlane.com/hyperstat/A11971.html
- The default value for the method argument to
cor()
isc("pearson", "kendall", "spearman")
. What does that mean? What value is used by default?
The default method is “pearson”. It works that way because inside the
function the first comparison that is made is
if (method == "pearson")
, which evaluates to [1] TRUE FALSE FALSE
using the default value, and since if
is not vectorised, it considers
onlye the first value of the resulting logical vector.