Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified docs/1_Getting_Started/datasets_getting_started/01_census2016.dta
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Drawing insights from data requires information to be presented in a way that is
1. Observations where data for key variables are missing are removed or stored in a different data set (e.g., `df_raw`). *Missing data* can create bias in your analysis.
2. Data set is *tidy*, i.e., each row captures only one observation and each column captures only one variable/characteristic of the observation. Data scraped and collected manually or using automation often comes in *untidy* shapes (e.g., two variables might be placed in the same column separated with a hyphen `-`).

In this notebook, we teach you how to load data sets properly in R and then clean them using some common methods from the `haven` and `tidyverse` packages.
In this notebook, we teach you how to load datasets properly in R and then clean them using some common methods from the `haven` and `tidyverse` packages.

## Part 1: Introduction to data in R

Expand Down Expand Up @@ -160,7 +160,7 @@ Look at line `pr` in the output from `glimpse` above:

The `pr` variable in the Census data stands for province. Do these look much like Canadian provinces to you? We can see the variable type is `<dbl+lbl>`: this is a *labeled double*. Let's transform this variable type into factors.

There are three ways to change variables types into factor variables.
There are three ways to change variable types into factor variables.

1. We can change a specific variable inside a dataframe to a factor by using the `as_factor` command

Expand Down Expand Up @@ -217,7 +217,7 @@ Here is our final dataset, all cleaned up! Notice that some of the variables (e.

### Creating new variables

Another important clean-up task is to make new variables. The best way to create a new variables is using the `mutate` command.
Another important clean-up task is to make new variables. The best way to create a new variable is using the `mutate` command.

The `mutate` command is an efficient way of manipulating the columns of our data frame. We can use it to create new columns out of existing columns or with completely new inputs. The structure of the mutate command is as follows:

Expand All @@ -239,7 +239,7 @@ Do you see our new variable at the bottom? Nice!

### Test your knowledge

In the following code, what is (1) the name of the new variable created, (2) the inputs used to make the new variable, and (3) the function used to transform the inputs in the values of the new varible?
In the following code, what is (1) the name of the new variable created, (2) the inputs used to make the new variable, and (3) the function used to transform the inputs in the values of the new variable?

(A) grade_adjusted, grade and 2, mutate
(B) mutate, grade and 2, mutate
Expand Down Expand Up @@ -348,9 +348,9 @@ glimpse(census_data)

### Test your knowledge

Overwrite the existing `has_kids` variable with an new `has_kids` variable but with type factor.
Overwrite the existing `has_kids` variable with a new `has_kids` variable but with type factor.

> **Hint**: To overwrite a variable create a new variable with the same name as the name of the variable you want to ovewrite.
> **Hint**: To overwrite a variable, create a new variable with the same name as the name of the variable you want to overwrite.

```{r}
# use this cell to write your code
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ library(haven)

The World Development Indicators (WDI) is the primary World Bank collection of development indicators, compiled from officially recognized international sources.

We have used World Bank's DataBank applet to select and import some macro and development-related time series data for the countries Canada, India, Mexico, South Africa, and Zimbabwe for years 2014-2022.
We have used the World Bank's DataBank applet to select and import some macro and development-related time series data for the countries Canada, India, Mexico, South Africa, and Zimbabwe for years 2014-2022.

```{r, message = FALSE}
# importing required packages
Expand All @@ -67,7 +67,7 @@ head(wdi, 10)
dim(wdi)
```

The data frame is in long format. Each unique value in `Series_Name` an entry for a row, and not a column. As an example:
The data frame is in long format. Each unique value in `Series_Name` is an entry for a row, and not a column. As an example:


| Country | Year | Var | Value |
Expand All @@ -93,7 +93,7 @@ A simpler version of the data frame in a **wide-format** could look like this:

While `Series_Name` contains descriptions for each of the series in the data frame, `Series_Code` offers a handy way to group variables.

Our `Series_Code` variable follow a taxonomy system. For example, any code starting with `AG` belongs to a *family of series* related to the state of agriculture in that country. Let's see the unique series families and their sub series names.
Our `Series_Code` variable follows a taxonomy system. For example, any code starting with `AG` belongs to a *family of series* related to the state of agriculture in that country. Let's see the unique series families and their sub series names.

```{r}
Series_Families <- wdi %>%
Expand Down Expand Up @@ -156,7 +156,7 @@ select(access_wdi, c("Country_Code", "Series_Code", "count_na"))

This data frame shows that we don't have *any* data for series beginning with the `SE` (Schooling) prefix.

> **Think Deeper**: how could you be systematic when choosing whether to drop or not NA values? Our panel datasets is indexed by country and series, and we're interested in yearly values over time. Think about the options we have to drop NAs: (1) Dropping a series altogether (2) Dropping specific countries (3) Dropping specific rows (i.e., country-series pairs)...
> **Think Deeper**: how could you be systematic when choosing whether to drop or not NA values? Our panel dataset is indexed by country and series, and we're interested in yearly values over time. Think about the options we have to drop NAs: (1) Dropping a series altogether (2) Dropping specific countries (3) Dropping specific rows (i.e., country-series pairs)...

Let's create an array with `Series_Code` to be dropped from our dataset.

Expand Down Expand Up @@ -185,7 +185,7 @@ filtered_access_wdi

Now the only variables left in this data frame are the `EG` variables, which indicate the levels of access to electricity and other power sources within the countries.

This dataset is clearly not appropriate to answer question about the access to the overall access to institutions; however, it could be extremely useful if the scope of the research is narrowed to access to power utilities. For example, we could use the dataset to visualize the growth in access to energy across the countries over the last 5 years.
This dataset is clearly not appropriate to answer questions about overall access to institutions; however, it could be extremely useful if the scope of the research is narrowed to access to power utilities. For example, we could use the dataset to visualize the growth in access to energy across the countries over the last 5 years.

### Test your knowledge

Expand Down Expand Up @@ -250,7 +250,7 @@ test_2()

## Part 2: Merging data frames in R

Now let's take a step back and consider an example of merging data frames. Our WDI data set has macro information on *national incomes*, *CABs*, *Bank Capital to Assets Ratios,* and various kinds of *CPIA* ratings. Let's extract that data and merge it with data from the Quarterly Public Debt (QPD) data base. The QPD is exactly what you think it is: a record of sovereign debt managed by the World Bank and the International Monetary Fund (IMF).
Now let's take a step back and consider an example of merging data frames. Our WDI data set has macro information on *national incomes*, *CABs*, *Bank Capital to Assets Ratios,* and various kinds of *CPIA* ratings. Let's extract that data and merge it with data from the Quarterly Public Debt (QPD) database. The QPD is exactly what you think it is: a record of sovereign debt managed by the World Bank and the International Monetary Fund (IMF).

First, to our `wdi` dataset.

Expand Down Expand Up @@ -293,10 +293,10 @@ The series data in QPD is stored on a quarter-by-year basis. We can aggregate th

> **Note**: R usually throws an error if you're telling it to sum over certain rows/columns that include NA values. We resolve this by setting the parameter `na.rm = TRUE`, which literally means "NA remove is TRUE".

Before we aggregate the data, let's check the number of periods for which data is missing. Again, we'll look at the country-series combinations. We do that below with a loop, telling R to *manually* go over each unique row, count the number of NAs along the period columns, and then store the result in another dataframe called `status`.
Before we aggregate the data, let's check the number of periods for which data is missing. Again, we'll look at the country-series combinations. We do that below with a loop, telling R to *manually* go over each unique row, count the number of NAs along the period columns, and then store the result in another data frame called `status`.

```{r}
status <- data.frame() # empty data-frame that is placeholder for the final data
status <- data.frame() # empty data frame that is placeholder for the final data
Series_Codes <- qpd$Series_Code %>% unique() # gets all `Series_Codes` to iterate over
Countries <- qpd$Country_Code %>% unique() # gets all `Country_Codes` to iterate over

Expand All @@ -315,7 +315,7 @@ for (country_code in Countries) {

na_count <- sum(is.na(cols_to_check)) # finally, store the value of NAs
result <- data.frame(Country_Code = country_code, Series_Code = series_code, na_count = na_count)
status <- rbind(status, result) # appends the result to the status dataframe and iterate over
status <- rbind(status, result) # appends the result to the status data frame and iterate over
}
}

Expand All @@ -333,7 +333,7 @@ status_to_drop

These are the countries-series pairs which we must drop from the data-frame.

> **Note**: by storing our exploration's results in `status`, we can re-visit our decision to drop values anytime we want. Such proper documentation builds trust around the validity of the aggregate computations!
> **Note**: by storing our exploration's results in `status`, we can revisit our decision to drop values anytime we want. Such proper documentation builds trust around the validity of the aggregate computations!

Let's now use `anti_join()` to drop any rows from `qpd` that match the `Country_Code`, `Series_Code` pairs in `status_to_drop`.

Expand All @@ -342,9 +342,9 @@ qpd_filtered <- anti_join(qpd, status_to_drop, by = c("Country_Code", "Series_Co
qpd_filtered
```

> **Note**: `anti_join()` is a function that removes rows from a dataframe that have matching values in specified columns with another dataframe. In this context, it is used to drop rows from qpd that match the `Country_Code` and `Series_Code` pairs in `status_to_drop`, resulting in the filtered dataframe `qpd_filtered`.
> **Note**: `anti_join()` is a function that removes rows from a data frame that have matching values in specified columns with another data frame. In this context, it is used to drop rows from qpd that match the `Country_Code` and `Series_Code` pairs in `status_to_drop`, resulting in the filtered data frame `qpd_filtered`.

The code below tells R how to manually go over each unique combination of `Country`, `Series_Code` values and aggregate quaterly values by year. To learn exactly what each line does, head to the Appendix!
The code below tells R how to manually go over each unique combination of `Country`, `Series_Code` values and aggregate quarterly values by year. To learn exactly what each line does, head to the Appendix!

```{r}
# pivot the data from wide to long format, creating separate rows for each quarterly value
Expand Down Expand Up @@ -423,7 +423,7 @@ Take a look at the image below to learn about other methods for merging.

![Merging Methods - COMET Team](media/join_visual.png)

As illustrated in the image, a `full_join()` and `right_join()` are great for merging data sources in situations when we are particularly interested in the issue of missing matches.For simpler cases, `inner_join()` is ideal when you only want to include fully matched observations in the final data set.
As illustrated in the image, a `full_join()` and `right_join()` are great for merging data sources in situations when we are particularly interested in the issue of missing matches. For simpler cases, `inner_join()` is ideal when you only want to include fully matched observations in the final data set.

### Test your knowledge

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ Below are a few key terms that define and contextualize components of the techni

> For example, when the operation $1+1$ is typed into a JupyterNotebook, the web browser (that you are viewing the notebook in) sends a request to the kernel (for this notebook, the *R* kernel is being used) which computes the request and sends the answer back to your notebook, producing the result: $2$.

One final term to consider as you begin using notebooks for econometrics is **reproducibility**. Reproducibility is a core priority in empirical economics and data science. It means ensuring the creation of analyses that can reliably re-produce the same results when analyzing the same data at a later time; reproducibility is a key component of the scientific method. Notebooks allow us to write executable code, attach multimedia and leave meaningful text annotations and discussion throughout our analysis, all of which contribute to a reproducible and transparent data workflow.
One final term to consider as you begin using notebooks for econometrics is **reproducibility**. Reproducibility is a core priority in empirical economics and data science. It means ensuring the creation of analyses that can reliably reproduce the same results when analyzing the same data at a later time; reproducibility is a key component of the scientific method. Notebooks allow us to write executable code, attach multimedia and leave meaningful text annotations and discussion throughout our analysis, all of which contribute to a reproducible and transparent data workflow.

::: callout-note
### 🔎 **Let's think critically**
Expand Down Expand Up @@ -155,7 +155,7 @@ Now try this one:
# Replace ... with your answer
# Your answer should be a single digit

answer_2 <- "..."
answer_2 <- ...

test_2()
```
Expand Down Expand Up @@ -248,7 +248,7 @@ The rule of thumb, then, is to ***always write and execute code from the start t

### Directories

Notebooks are stored in **directories** in JupyterHub. It can be helpful to think about JupyterHub as an actual hub - that is, a place where different hub users (holding individual Jupyter accounts) can gather to share and collaborate on files. Directories store files in a similar way that a folder on our computer does. The only difference is, with JupyterHub, the cloud-based format allows directories to be used either individually or collaboratively. The directory browser is located on the lefthand side of the notebook interface and can be used to store other files including:
Notebooks are stored in **directories** in JupyterHub. It can be helpful to think about JupyterHub as an actual hub - that is, a place where different hub users (holding individual Jupyter accounts) can gather to share and collaborate on files. Directories store files in a similar way that a folder on our computer does. The only difference is, with JupyterHub, the cloud-based format allows directories to be used either individually or collaboratively. The directory browser is located on the left hand side of the notebook interface and can be used to store other files including:

- Images
- Data Files
Expand Down Expand Up @@ -309,7 +309,7 @@ To export your file, go to **File** \> **Save and Export As...** \> Then select

A few notes about writing your own Notebooks:

- Cells can be added to notebooks (when the cell is selected, indicated by a blue highlight on the lefthand side) by using the `+` arrow near the top right of the the window (not the blue button). Alternatively, you can use the `a` key to add a cell above your current cell, and `b` to add a cell below your current cell.
- Cells can be added to notebooks (when the cell is selected, indicated by a blue highlight on the left hand side) by using the `+` arrow near the top right of the window (not the blue button). Alternatively, you can use the `a` key to add a cell above your current cell, and `b` to add a cell below your current cell.
- Cells can be deleted by right-clicking on the cell and selecting **Delete Cells**. Alternatively, you can double click the `d` key to delete a current cell.
- A cell's status (as either a code or markdown cell) is always indicated and can be changed in the dropdown bar of the menu.

Expand Down
Loading