In this chapter, you will be learn about how to use functionality from the dplyr package to wrangle your data. Data wrangling is the catchall phrase that includes the processes of cleaning, structuring, and summarizing your data. It is a skill that every educational scientist needs to have in their computational toolkit!
We will use the riverview.csv data to illustrate several data wrangling ideas. The data contain five attributes collected from a random sample of \(n=32\) employees working for the city of Riverview, a hypothetical midwestern city (see the data codebook). To begin, we will load several libraries and import the data into an object called city
.
# Load libraries
library(dplyr)
library(readr)
# Read in data
= read_csv(file = "https://raw.githubusercontent.com/zief0002/modeling/master/data/riverview.csv")
city head(city)
# A tibble: 6 × 5
education income seniority gender party
<dbl> <dbl> <dbl> <chr> <chr>
1 8 26.4 9 female Independent
2 8 37.4 7 Not female Democrat
3 10 34.2 16 female Independent
4 10 25.5 1 female Republican
5 10 47.0 14 Not female Democrat
6 12 46.5 11 female Democrat
Recall that functions work by taking arguments as inputs and then producing an output. For example, the summary()
function takes the city
data frame as its input.
summary(city)
education income seniority gender
Min. : 8 Min. :25.48 Min. : 1.00 Length:32
1st Qu.:12 1st Qu.:44.51 1st Qu.: 9.75 Class :character
Median :16 Median :55.83 Median :15.00 Mode :character
Mean :16 Mean :53.74 Mean :14.81
3rd Qu.:20 3rd Qu.:62.72 3rd Qu.:20.25
Max. :24 Max. :82.73 Max. :27.00
party
Length:32
Class :character
Mode :character
We could get the same result by using the piping operator (%>%
). This operator takes a DATA FRAME (given immediately before the operator) and uses it as the FIRST argument in the function that comes immediately after the operator.
# The pipe operator makes city the first argument of the summary function
%>% summary() city
education income seniority gender
Min. : 8 Min. :25.48 Min. : 1.00 Length:32
1st Qu.:12 1st Qu.:44.51 1st Qu.: 9.75 Class :character
Median :16 Median :55.83 Median :15.00 Mode :character
Mean :16 Mean :53.74 Mean :14.81
3rd Qu.:20 3rd Qu.:62.72 3rd Qu.:20.25
Max. :24 Max. :82.73 Max. :27.00
party
Length:32
Class :character
Mode :character
Note since the summary()
function did NOT include any additional arguments, we do not include anything between the parentheses after we pipe. Here is another example that illustrate the use of the pipe operator.
# Count number of rows in city data frame
nrow(city)
[1] 32
# Can be written using the pipe operator as...
%>% nrow() city
[1] 32
The filter()
function from the dplyr package is used to select certain rows from a data frame. For example, the syntax below selects all rows from the city
data frame that have a gender
value of female
. To do this we provide a logical statement as the second argument to the filter()
function.
# Select the rows from city that have a gender value of "female"
filter(city, gender == "female")
# A tibble: 18 × 5
education income seniority gender party
<dbl> <dbl> <dbl> <chr> <chr>
1 8 26.4 9 female Independent
2 10 34.2 16 female Independent
3 10 25.5 1 female Republican
4 12 46.5 11 female Democrat
5 12 52.5 16 female Independent
6 14 32.6 5 female Independent
7 15 37.3 8 female Democrat
8 16 38.6 11 female Independent
9 16 55.9 22 female Independent
10 16 59.5 20 female Independent
11 17 60.1 10 female Independent
12 18 54.8 20 female Republican
13 18 62.5 16 female Republican
14 19 56.0 21 female Independent
15 20 56.3 22 female Independent
16 20 54.7 12 female Independent
17 21 71.2 26 female Democrat
18 22 56.3 8 female Independent
# Can be written using the pipe operator as...
%>% filter(gender == "female") city
# A tibble: 18 × 5
education income seniority gender party
<dbl> <dbl> <dbl> <chr> <chr>
1 8 26.4 9 female Independent
2 10 34.2 16 female Independent
3 10 25.5 1 female Republican
4 12 46.5 11 female Democrat
5 12 52.5 16 female Independent
6 14 32.6 5 female Independent
7 15 37.3 8 female Democrat
8 16 38.6 11 female Independent
9 16 55.9 22 female Independent
10 16 59.5 20 female Independent
11 17 60.1 10 female Independent
12 18 54.8 20 female Republican
13 18 62.5 16 female Republican
14 19 56.0 21 female Independent
15 20 56.3 22 female Independent
16 20 54.7 12 female Independent
17 21 71.2 26 female Democrat
18 22 56.3 8 female Independent
Here, since the filter()
function included a second argument, we include that argument in the function that the data frame is piped into. What is piped into the function, in this case city
, will be automatically inputted into the FIRST argument.
Here are some common operations that researchers use to prepare data for analysis (i.e., data preparation, data wrangling, data cleaning) and the corresponding dplyr functions.
Data wrangling activity | dplyr function |
---|---|
Select a subset of rows from a data frame. |
filter()
|
Select a subset of columns from a data frame. |
select()
|
Add new columns to a data frame. |
mutate()
|
Sort and re-order data in a data frame. |
arrange()
|
Compute summaries of columns in a data frame. |
summarize()
|
Group the data to carry out computations for each group. |
group_by()
|
To select a subset of rows, we will pipe the data frame we want to select rows from into the filter()
function. The argument(s) for this function are logical expressions that will be used to select the rows.
# Select the female employees
%>%
city filter(gender == "female")
# A tibble: 18 × 5
education income seniority gender party
<dbl> <dbl> <dbl> <chr> <chr>
1 8 26.4 9 female Independent
2 10 34.2 16 female Independent
3 10 25.5 1 female Republican
4 12 46.5 11 female Democrat
5 12 52.5 16 female Independent
6 14 32.6 5 female Independent
7 15 37.3 8 female Democrat
8 16 38.6 11 female Independent
9 16 55.9 22 female Independent
10 16 59.5 20 female Independent
11 17 60.1 10 female Independent
12 18 54.8 20 female Republican
13 18 62.5 16 female Republican
14 19 56.0 21 female Independent
15 20 56.3 22 female Independent
16 20 54.7 12 female Independent
17 21 71.2 26 female Democrat
18 22 56.3 8 female Independent
Here we are selecting only the rows where the gender variable is equal to (==
) the character string “female.” Recall that a single equals sign (=
) is the assignment operator and that to say “is equal to,” we need to use two equals signs (==
).
It is a good coding practice to use multiple lines when you are piping rather than putting all the syntax on a single line. When you do this, the pipe operator (%>%
) needs to come at the end of the line. For example, in the code above, the pip operator is at the end of the first line of syntax rather than at the beginning of the second line of syntax. Include a line break after every pipe operator you use.
Note that the output from this computation (the rows of female employees) is only printed to the screen. If you want to keep the filtered data or operate on it further, you need to assign the output into an object.
# Filter the female employees
= city %>%
female_employees filter(gender == "female")
# Count the number of rows (females)
nrow(female_employees)
[1] 18
We could have found the same result exclusively using piping; without the interim assignment.
%>%
city filter(gender == "female") %>%
nrow()
[1] 18
The first pipe operator uses the city
data frame in the filter()
function to select the female employees. This output (only the female employees) is then used in the nrow()
function to count the number of rows. At is akin to a constant pipeline of chaining functions together; the output of a computation is used as the input into the next computation in the pipeline.
Here we use filter()
to select the employees that have less than a high school level of education and then summarize all of the columns using the summary()
function.
%>%
city filter(education < 12) %>%
summary()
education income seniority gender
Min. : 8.0 Min. :25.48 Min. : 1.0 Length:5
1st Qu.: 8.0 1st Qu.:26.43 1st Qu.: 7.0 Class :character
Median :10.0 Median :34.18 Median : 9.0 Mode :character
Mean : 9.2 Mean :34.11 Mean : 9.4
3rd Qu.:10.0 3rd Qu.:37.45 3rd Qu.:14.0
Max. :10.0 Max. :47.03 Max. :16.0
party
Length:5
Class :character
Mode :character
You can filter on multiple attributes by including more than one logical statement in the filter()
function. For example, the syntax below counts the number of female employees with less than a high school level of education.
%>%
city filter(gender == "female", education < 12) %>%
nrow()
[1] 3
Here, when we included multiple logical expressions in the filter()
function, separated by a comma, they were linked using the AND (&
) operator. This means that both expressions have to evaluate as TRUE
to be included. We could also have explicitly used the &
operator to link the two statements.
%>%
city filter(gender == "female", education < 12)
# Is equivalent to...
%>%
city filter(gender == "female" & education < 12)
We can also filter()
using the OR (|
) operator. This means that if EITHER expression evaluates as TRUE it is included.
%>%
city filter(gender == "female" | education < 12)
# A tibble: 20 × 5
education income seniority gender party
<dbl> <dbl> <dbl> <chr> <chr>
1 8 26.4 9 female Independent
2 8 37.4 7 Not female Democrat
3 10 34.2 16 female Independent
4 10 25.5 1 female Republican
5 10 47.0 14 Not female Democrat
6 12 46.5 11 female Democrat
7 12 52.5 16 female Independent
8 14 32.6 5 female Independent
9 15 37.3 8 female Democrat
10 16 38.6 11 female Independent
11 16 55.9 22 female Independent
12 16 59.5 20 female Independent
13 17 60.1 10 female Independent
14 18 54.8 20 female Republican
15 18 62.5 16 female Republican
16 19 56.0 21 female Independent
17 20 56.3 22 female Independent
18 20 54.7 12 female Independent
19 21 71.2 26 female Democrat
20 22 56.3 8 female Independent
This syntax would select any employee that is either female OR has an education less than 12 years.
To select a subset of columns, we will use the select()
function. The argument(s) for this function are the column names of the data frame that you want to select. For example, to select the education
, income
, and gender
columns from the city
data frame we would use the following syntax:
%>%
city select(education, income, gender)
# A tibble: 32 × 3
education income gender
<dbl> <dbl> <chr>
1 8 26.4 female
2 8 37.4 Not female
3 10 34.2 female
4 10 25.5 female
5 10 47.0 Not female
6 12 46.5 female
7 12 52.5 female
8 12 37.7 Not female
9 12 50.3 Not female
10 14 32.6 female
# … with 22 more rows
There are a number of helper functions you can use within the select()
function. For example, starts_with()
, ends_with()
, and contains()
. These let you quickly match larger blocks of columns that meet some criterion. The syntax below selects all the columns that have a column name that ends in ‘e.’
%>%
city select(ends_with("e"))
# A tibble: 32 × 1
income
<dbl>
1 26.4
2 37.4
3 34.2
4 25.5
5 47.0
6 46.5
7 52.5
8 37.7
9 50.3
10 32.6
# … with 22 more rows
You can rename a column by using the rename()
function. Here we select the education
, income
, and gender
columns from the city
data frame and then rename the education
column to educ
. Note that this works similar to assignment in that the new column name is to the left of the equal sign.
%>%
city select(education, income, gender) %>%
rename(educ = education)
# A tibble: 32 × 3
educ income gender
<dbl> <dbl> <chr>
1 8 26.4 female
2 8 37.4 Not female
3 10 34.2 female
4 10 25.5 female
5 10 47.0 Not female
6 12 46.5 female
7 12 52.5 female
8 12 37.7 Not female
9 12 50.3 Not female
10 14 32.6 female
# … with 22 more rows
To create new columns, we will use the mutate()
function. Here we create a new column called income2
based on multiplying the original income
column by 1000.
%>%
city mutate(
income2 = income * 1000
)
# A tibble: 32 × 6
education income seniority gender party income2
<dbl> <dbl> <dbl> <chr> <chr> <dbl>
1 8 26.4 9 female Independent 26430
2 8 37.4 7 Not female Democrat 37449
3 10 34.2 16 female Independent 34182
4 10 25.5 1 female Republican 25479
5 10 47.0 14 Not female Democrat 47034
6 12 46.5 11 female Democrat 46488
7 12 52.5 16 female Independent 52480
8 12 37.7 14 Not female Democrat 37656
9 12 50.3 24 Not female Democrat 50265
10 14 32.6 5 female Independent 32631
# … with 22 more rows
Depending on your print options, the results in income2
may be in scientific notation. For example 3.74e+04
is equivalent to \(3.74 \times 10^4=37400\).
You can “turn off” scientific notation by running the following syntax in your R session:
options(scipen = 99)
You will need to run this syntax every R session.
You can create multiple new columns within the same mutate()
function. Simply include each new column as an argument. Below we again create income2
, but we also additionally create cent_educ
which computes the difference between each employee’s education level and the mean education level.
%>%
city mutate(
income2 = income * 1000,
cent_educ = education - mean(education)
)
# A tibble: 32 × 7
education income seniority gender party income2 cent_educ
<dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 8 26.4 9 female Independent 26430 -8
2 8 37.4 7 Not female Democrat 37449 -8
3 10 34.2 16 female Independent 34182 -6
4 10 25.5 1 female Republican 25479 -6
5 10 47.0 14 Not female Democrat 47034 -6
6 12 46.5 11 female Democrat 46488 -4
7 12 52.5 16 female Independent 52480 -4
8 12 37.7 14 Not female Democrat 37656 -4
9 12 50.3 24 Not female Democrat 50265 -4
10 14 32.6 5 female Independent 32631 -2
# … with 22 more rows
The arrange()
function sorts the data based on the values within one or more specified columns. The data is ordered based on the column name provided in the argument(s). The syntax below sorts the rows in the city
data frame from smallest to largest income.
%>%
city arrange(income)
# A tibble: 32 × 5
education income seniority gender party
<dbl> <dbl> <dbl> <chr> <chr>
1 10 25.5 1 female Republican
2 8 26.4 9 female Independent
3 14 32.6 5 female Independent
4 10 34.2 16 female Independent
5 15 37.3 8 female Democrat
6 8 37.4 7 Not female Democrat
7 12 37.7 14 Not female Democrat
8 16 38.6 11 female Independent
9 12 46.5 11 female Democrat
10 10 47.0 14 Not female Democrat
# … with 22 more rows
Providing the arrange()
function multiple arguments sort initially by the column name given in first argument, and then by the columns given in subsequent arguments. Here the data are sorted first by gender (alphabetically since gender
is a character string) and then by income (lowest to highest).
%>%
city arrange(gender, income)
# A tibble: 32 × 5
education income seniority gender party
<dbl> <dbl> <dbl> <chr> <chr>
1 10 25.5 1 female Republican
2 8 26.4 9 female Independent
3 14 32.6 5 female Independent
4 10 34.2 16 female Independent
5 15 37.3 8 female Democrat
6 16 38.6 11 female Independent
7 12 46.5 11 female Democrat
8 12 52.5 16 female Independent
9 20 54.7 12 female Independent
10 18 54.8 20 female Republican
# … with 22 more rows
Use the desc()
function on a column name to sort the data in descending order. Here the data are sorted first by gender (alphabetically since gender
is a character string) and then by income. Here the data are sorted first by gender (alphabetically) and then by income (highest to lowest).
%>%
city arrange(gender, desc(income))
# A tibble: 32 × 5
education income seniority gender party
<dbl> <dbl> <dbl> <chr> <chr>
1 21 71.2 26 female Democrat
2 18 62.5 16 female Republican
3 17 60.1 10 female Independent
4 16 59.5 20 female Independent
5 20 56.3 22 female Independent
6 22 56.3 8 female Independent
7 19 56.0 21 female Independent
8 16 55.9 22 female Independent
9 18 54.8 20 female Republican
10 20 54.7 12 female Independent
# … with 22 more rows
The summarize()
function is used to compute summaries of data in a given column. Here we compute the mean income for all the employees.
%>%
city summarize(
M = mean(income)
)
# A tibble: 1 × 1
M
<dbl>
1 53.7
The output from summarize()
is a data frame with a single row and one or more columns, depending on how many summaries you computed. Here we computed a single summary so there is only one column. We also named the column M
within the summarize()
function.
Multiple summaries can be computed by providing more than one argument to the summarize()
function. The output is still a single row data frame, but now there will be multiple columns, one for each summary computation. Here we compute the mean income for all the employees and the standard deviation of the incomes.
%>%
city summarize(
M = mean(income),
SD = sd(income)
)
# A tibble: 1 × 2
M SD
<dbl> <dbl>
1 53.7 14.6
The group_by()
function groups the data by a specified variable. By itself, this function essentially does nothing. But it is powerful when the grouped output is piped to other functions, such as summarize()
. Here we use group_by(gender)
to compute the mean income and the standard deviation of the incomes for both males and females.
%>%
city group_by(gender) %>%
summarize(
M = mean(income),
SD = sd(income)
)
# A tibble: 2 × 3
gender M SD
<chr> <dbl> <dbl>
1 female 48.9 13.3
2 Not female 59.9 14.2
You can also use group_by()
with multiple attributes. Simply add additional column names in the group_by()
function to create more conditional groups. For example to compute the mean income and standard deviation for males and females conditioned on political party, we can use the following syntax.
%>%
city group_by(gender, party) %>%
summarize(
M = mean(income),
SD = sd(income)
)
# A tibble: 6 × 4
# Groups: gender [2]
gender party M SD
<chr> <chr> <dbl> <dbl>
1 female Democrat 51.7 17.5
2 female Independent 48.6 12.0
3 female Republican 47.6 19.5
4 Not female Democrat 53.0 14.9
5 Not female Independent 61.6 10.4
6 Not female Republican 70.8 9.41
This produces the summary measures for each of the combinations of the levels of gender and political affiliation in the data.