Data Wrangling with dplyr

In this chapter, you will be learn about how to use functionality from the dplyr package to wrangle your data. Data wrangling is the catchall phrase that includes the processes of cleaning, structuring, and summarizing your data. It is a skill that every educational scientist needs to have in their computational toolkit!

We will use the riverview.csv data to illustrate several data wrangling ideas. The data contain five attributes collected from a random sample of \(n=32\) employees working for the city of Riverview, a hypothetical midwestern city (see the data codebook). To begin, we will load several libraries and import the data into an object called city.

# Load libraries
library(dplyr)
library(readr)

# Read in data
city = read_csv(file = "https://raw.githubusercontent.com/zief0002/modeling/master/data/riverview.csv")
head(city)

# A tibble: 6 × 5
  education income seniority gender     party      
      <dbl>  <dbl>     <dbl> <chr>      <chr>      
1         8   26.4         9 female     Independent
2         8   37.4         7 Not female Democrat   
3        10   34.2        16 female     Independent
4        10   25.5         1 female     Republican 
5        10   47.0        14 Not female Democrat   
6        12   46.5        11 female     Democrat

Piping: The Key to Using dplyr

Recall that functions work by taking arguments as inputs and then producing an output. For example, the summary() function takes the city data frame as its input.

summary(city)

   education      income        seniority        gender         
 Min.   : 8   Min.   :25.48   Min.   : 1.00   Length:32         
 1st Qu.:12   1st Qu.:44.51   1st Qu.: 9.75   Class :character  
 Median :16   Median :55.83   Median :15.00   Mode  :character  
 Mean   :16   Mean   :53.74   Mean   :14.81                     
 3rd Qu.:20   3rd Qu.:62.72   3rd Qu.:20.25                     
 Max.   :24   Max.   :82.73   Max.   :27.00                     
    party          
 Length:32         
 Class :character  
 Mode  :character

We could get the same result by using the piping operator (%>%). This operator takes a DATA FRAME (given immediately before the operator) and uses it as the FIRST argument in the function that comes immediately after the operator.

# The pipe operator makes city the first argument of the summary function
city %>% summary()

   education      income        seniority        gender         
 Min.   : 8   Min.   :25.48   Min.   : 1.00   Length:32         
 1st Qu.:12   1st Qu.:44.51   1st Qu.: 9.75   Class :character  
 Median :16   Median :55.83   Median :15.00   Mode  :character  
 Mean   :16   Mean   :53.74   Mean   :14.81                     
 3rd Qu.:20   3rd Qu.:62.72   3rd Qu.:20.25                     
 Max.   :24   Max.   :82.73   Max.   :27.00                     
    party          
 Length:32         
 Class :character  
 Mode  :character

Note since the summary() function did NOT include any additional arguments, we do not include anything between the parentheses after we pipe. Here is another example that illustrate the use of the pipe operator.

# Count number of rows in city data frame
nrow(city)

[1] 32

# Can be written using the pipe operator as...
city %>% nrow()

[1] 32

The filter() function from the dplyr package is used to select certain rows from a data frame. For example, the syntax below selects all rows from the city data frame that have a gender value of female. To do this we provide a logical statement as the second argument to the filter() function.

# Select the rows from city that have a gender value of "female" 
filter(city, gender == "female")

# A tibble: 18 × 5
   education income seniority gender party      
       <dbl>  <dbl>     <dbl> <chr>  <chr>      
 1         8   26.4         9 female Independent
 2        10   34.2        16 female Independent
 3        10   25.5         1 female Republican 
 4        12   46.5        11 female Democrat   
 5        12   52.5        16 female Independent
 6        14   32.6         5 female Independent
 7        15   37.3         8 female Democrat   
 8        16   38.6        11 female Independent
 9        16   55.9        22 female Independent
10        16   59.5        20 female Independent
11        17   60.1        10 female Independent
12        18   54.8        20 female Republican 
13        18   62.5        16 female Republican 
14        19   56.0        21 female Independent
15        20   56.3        22 female Independent
16        20   54.7        12 female Independent
17        21   71.2        26 female Democrat   
18        22   56.3         8 female Independent

# Can be written using the pipe operator as...
city %>% filter(gender == "female")

# A tibble: 18 × 5
   education income seniority gender party      
       <dbl>  <dbl>     <dbl> <chr>  <chr>      
 1         8   26.4         9 female Independent
 2        10   34.2        16 female Independent
 3        10   25.5         1 female Republican 
 4        12   46.5        11 female Democrat   
 5        12   52.5        16 female Independent
 6        14   32.6         5 female Independent
 7        15   37.3         8 female Democrat   
 8        16   38.6        11 female Independent
 9        16   55.9        22 female Independent
10        16   59.5        20 female Independent
11        17   60.1        10 female Independent
12        18   54.8        20 female Republican 
13        18   62.5        16 female Republican 
14        19   56.0        21 female Independent
15        20   56.3        22 female Independent
16        20   54.7        12 female Independent
17        21   71.2        26 female Democrat   
18        22   56.3         8 female Independent

Here, since the filter() function included a second argument, we include that argument in the function that the data frame is piped into. What is piped into the function, in this case city, will be automatically inputted into the FIRST argument.

Common dplyr Functions for Data Wrangling

Here are some common operations that researchers use to prepare data for analysis (i.e., data preparation, data wrangling, data cleaning) and the corresponding dplyr functions.

Table 1: Common data wrangling activities and the corresponding **dplyr** function.
Data wrangling activity	dplyr function
Select a subset of rows from a data frame.	`filter()`
Select a subset of columns from a data frame.	`select()`
Add new columns to a data frame.	`mutate()`
Sort and re-order data in a data frame.	`arrange()`
Compute summaries of columns in a data frame.	`summarize()`
Group the data to carry out computations for each group.	`group_by()`

Select a Subset of Rows: Filtering

To select a subset of rows, we will pipe the data frame we want to select rows from into the filter() function. The argument(s) for this function are logical expressions that will be used to select the rows.

# Select the female employees
city %>% 
  filter(gender == "female")

# A tibble: 18 × 5
   education income seniority gender party      
       <dbl>  <dbl>     <dbl> <chr>  <chr>      
 1         8   26.4         9 female Independent
 2        10   34.2        16 female Independent
 3        10   25.5         1 female Republican 
 4        12   46.5        11 female Democrat   
 5        12   52.5        16 female Independent
 6        14   32.6         5 female Independent
 7        15   37.3         8 female Democrat   
 8        16   38.6        11 female Independent
 9        16   55.9        22 female Independent
10        16   59.5        20 female Independent
11        17   60.1        10 female Independent
12        18   54.8        20 female Republican 
13        18   62.5        16 female Republican 
14        19   56.0        21 female Independent
15        20   56.3        22 female Independent
16        20   54.7        12 female Independent
17        21   71.2        26 female Democrat   
18        22   56.3         8 female Independent

Here we are selecting only the rows where the gender variable is equal to (==) the character string “female.” Recall that a single equals sign (=) is the assignment operator and that to say “is equal to,” we need to use two equals signs (==).

It is a good coding practice to use multiple lines when you are piping rather than putting all the syntax on a single line. When you do this, the pipe operator (%>%) needs to come at the end of the line. For example, in the code above, the pip operator is at the end of the first line of syntax rather than at the beginning of the second line of syntax. Include a line break after every pipe operator you use.

Note that the output from this computation (the rows of female employees) is only printed to the screen. If you want to keep the filtered data or operate on it further, you need to assign the output into an object.

# Filter the female employees
female_employees = city %>% 
  filter(gender == "female")

# Count the number of rows (females)
nrow(female_employees)

[1] 18

We could have found the same result exclusively using piping; without the interim assignment.

city %>% 
  filter(gender == "female") %>%
  nrow()

[1] 18

The first pipe operator uses the city data frame in the filter() function to select the female employees. This output (only the female employees) is then used in the nrow() function to count the number of rows. At is akin to a constant pipeline of chaining functions together; the output of a computation is used as the input into the next computation in the pipeline.

Here we use filter() to select the employees that have less than a high school level of education and then summarize all of the columns using the summary() function.

city %>% 
  filter(education < 12) %>%
  summary()

   education        income        seniority       gender         
 Min.   : 8.0   Min.   :25.48   Min.   : 1.0   Length:5          
 1st Qu.: 8.0   1st Qu.:26.43   1st Qu.: 7.0   Class :character  
 Median :10.0   Median :34.18   Median : 9.0   Mode  :character  
 Mean   : 9.2   Mean   :34.11   Mean   : 9.4                     
 3rd Qu.:10.0   3rd Qu.:37.45   3rd Qu.:14.0                     
 Max.   :10.0   Max.   :47.03   Max.   :16.0                     
    party          
 Length:5          
 Class :character  
 Mode  :character

Filtering on Multiple Attributes

You can filter on multiple attributes by including more than one logical statement in the filter() function. For example, the syntax below counts the number of female employees with less than a high school level of education.

city %>% 
  filter(gender == "female", education < 12) %>%
  nrow()

[1] 3

Here, when we included multiple logical expressions in the filter() function, separated by a comma, they were linked using the AND (&) operator. This means that both expressions have to evaluate as TRUE to be included. We could also have explicitly used the & operator to link the two statements.

city %>% 
  filter(gender == "female", education < 12)

# Is equivalent to...
city %>% 
  filter(gender == "female" & education < 12)

We can also filter() using the OR (|) operator. This means that if EITHER expression evaluates as TRUE it is included.

city %>% 
  filter(gender == "female" | education < 12)

# A tibble: 20 × 5
   education income seniority gender     party      
       <dbl>  <dbl>     <dbl> <chr>      <chr>      
 1         8   26.4         9 female     Independent
 2         8   37.4         7 Not female Democrat   
 3        10   34.2        16 female     Independent
 4        10   25.5         1 female     Republican 
 5        10   47.0        14 Not female Democrat   
 6        12   46.5        11 female     Democrat   
 7        12   52.5        16 female     Independent
 8        14   32.6         5 female     Independent
 9        15   37.3         8 female     Democrat   
10        16   38.6        11 female     Independent
11        16   55.9        22 female     Independent
12        16   59.5        20 female     Independent
13        17   60.1        10 female     Independent
14        18   54.8        20 female     Republican 
15        18   62.5        16 female     Republican 
16        19   56.0        21 female     Independent
17        20   56.3        22 female     Independent
18        20   54.7        12 female     Independent
19        21   71.2        26 female     Democrat   
20        22   56.3         8 female     Independent

This syntax would select any employee that is either female OR has an education less than 12 years.

Selecting a Subset and Renaming Columns

To select a subset of columns, we will use the select() function. The argument(s) for this function are the column names of the data frame that you want to select. For example, to select the education, income, and gender columns from the city data frame we would use the following syntax:

city %>% 
  select(education, income, gender)

# A tibble: 32 × 3
   education income gender    
       <dbl>  <dbl> <chr>     
 1         8   26.4 female    
 2         8   37.4 Not female
 3        10   34.2 female    
 4        10   25.5 female    
 5        10   47.0 Not female
 6        12   46.5 female    
 7        12   52.5 female    
 8        12   37.7 Not female
 9        12   50.3 Not female
10        14   32.6 female    
# … with 22 more rows

There are a number of helper functions you can use within the select() function. For example, starts_with(), ends_with(), and contains(). These let you quickly match larger blocks of columns that meet some criterion. The syntax below selects all the columns that have a column name that ends in ‘e.’

city %>% 
  select(ends_with("e"))

# A tibble: 32 × 1
   income
    <dbl>
 1   26.4
 2   37.4
 3   34.2
 4   25.5
 5   47.0
 6   46.5
 7   52.5
 8   37.7
 9   50.3
10   32.6
# … with 22 more rows

Renaming Columns

You can rename a column by using the rename() function. Here we select the education, income, and gender columns from the city data frame and then rename the education column to educ. Note that this works similar to assignment in that the new column name is to the left of the equal sign.

city %>% 
  select(education, income, gender) %>%
  rename(educ = education)

# A tibble: 32 × 3
    educ income gender    
   <dbl>  <dbl> <chr>     
 1     8   26.4 female    
 2     8   37.4 Not female
 3    10   34.2 female    
 4    10   25.5 female    
 5    10   47.0 Not female
 6    12   46.5 female    
 7    12   52.5 female    
 8    12   37.7 Not female
 9    12   50.3 Not female
10    14   32.6 female    
# … with 22 more rows

Create New Columns: Mutating

To create new columns, we will use the mutate() function. Here we create a new column called income2 based on multiplying the original income column by 1000.

city %>% 
  mutate(
    income2 = income * 1000
    )

# A tibble: 32 × 6
   education income seniority gender     party       income2
       <dbl>  <dbl>     <dbl> <chr>      <chr>         <dbl>
 1         8   26.4         9 female     Independent   26430
 2         8   37.4         7 Not female Democrat      37449
 3        10   34.2        16 female     Independent   34182
 4        10   25.5         1 female     Republican    25479
 5        10   47.0        14 Not female Democrat      47034
 6        12   46.5        11 female     Democrat      46488
 7        12   52.5        16 female     Independent   52480
 8        12   37.7        14 Not female Democrat      37656
 9        12   50.3        24 Not female Democrat      50265
10        14   32.6         5 female     Independent   32631
# … with 22 more rows

Depending on your print options, the results in income2 may be in scientific notation. For example 3.74e+04 is equivalent to \(3.74 \times 10^4=37400\).

You can “turn off” scientific notation by running the following syntax in your R session:

options(scipen = 99)

You will need to run this syntax every R session.

You can create multiple new columns within the same mutate() function. Simply include each new column as an argument. Below we again create income2, but we also additionally create cent_educ which computes the difference between each employee’s education level and the mean education level.

city %>% 
  mutate(
    income2 = income * 1000,
    cent_educ = education - mean(education)
    )

# A tibble: 32 × 7
   education income seniority gender     party       income2 cent_educ
       <dbl>  <dbl>     <dbl> <chr>      <chr>         <dbl>     <dbl>
 1         8   26.4         9 female     Independent   26430        -8
 2         8   37.4         7 Not female Democrat      37449        -8
 3        10   34.2        16 female     Independent   34182        -6
 4        10   25.5         1 female     Republican    25479        -6
 5        10   47.0        14 Not female Democrat      47034        -6
 6        12   46.5        11 female     Democrat      46488        -4
 7        12   52.5        16 female     Independent   52480        -4
 8        12   37.7        14 Not female Democrat      37656        -4
 9        12   50.3        24 Not female Democrat      50265        -4
10        14   32.6         5 female     Independent   32631        -2
# … with 22 more rows

Sorting the Data: Arranging

The arrange() function sorts the data based on the values within one or more specified columns. The data is ordered based on the column name provided in the argument(s). The syntax below sorts the rows in the city data frame from smallest to largest income.

city %>% 
  arrange(income)

# A tibble: 32 × 5
   education income seniority gender     party      
       <dbl>  <dbl>     <dbl> <chr>      <chr>      
 1        10   25.5         1 female     Republican 
 2         8   26.4         9 female     Independent
 3        14   32.6         5 female     Independent
 4        10   34.2        16 female     Independent
 5        15   37.3         8 female     Democrat   
 6         8   37.4         7 Not female Democrat   
 7        12   37.7        14 Not female Democrat   
 8        16   38.6        11 female     Independent
 9        12   46.5        11 female     Democrat   
10        10   47.0        14 Not female Democrat   
# … with 22 more rows

Providing the arrange() function multiple arguments sort initially by the column name given in first argument, and then by the columns given in subsequent arguments. Here the data are sorted first by gender (alphabetically since gender is a character string) and then by income (lowest to highest).

city %>% 
  arrange(gender, income)

# A tibble: 32 × 5
   education income seniority gender party      
       <dbl>  <dbl>     <dbl> <chr>  <chr>      
 1        10   25.5         1 female Republican 
 2         8   26.4         9 female Independent
 3        14   32.6         5 female Independent
 4        10   34.2        16 female Independent
 5        15   37.3         8 female Democrat   
 6        16   38.6        11 female Independent
 7        12   46.5        11 female Democrat   
 8        12   52.5        16 female Independent
 9        20   54.7        12 female Independent
10        18   54.8        20 female Republican 
# … with 22 more rows

Use the desc() function on a column name to sort the data in descending order. Here the data are sorted first by gender (alphabetically since gender is a character string) and then by income. Here the data are sorted first by gender (alphabetically) and then by income (highest to lowest).

city %>% 
  arrange(gender, desc(income))

# A tibble: 32 × 5
   education income seniority gender party      
       <dbl>  <dbl>     <dbl> <chr>  <chr>      
 1        21   71.2        26 female Democrat   
 2        18   62.5        16 female Republican 
 3        17   60.1        10 female Independent
 4        16   59.5        20 female Independent
 5        20   56.3        22 female Independent
 6        22   56.3         8 female Independent
 7        19   56.0        21 female Independent
 8        16   55.9        22 female Independent
 9        18   54.8        20 female Republican 
10        20   54.7        12 female Independent
# … with 22 more rows

Computing Summaries of Data in a Column

The summarize() function is used to compute summaries of data in a given column. Here we compute the mean income for all the employees.

city %>% 
  summarize(
    M = mean(income)
    )

# A tibble: 1 × 1
      M
  <dbl>
1  53.7

The output from summarize() is a data frame with a single row and one or more columns, depending on how many summaries you computed. Here we computed a single summary so there is only one column. We also named the column M within the summarize() function.

Multiple summaries can be computed by providing more than one argument to the summarize() function. The output is still a single row data frame, but now there will be multiple columns, one for each summary computation. Here we compute the mean income for all the employees and the standard deviation of the incomes.

city %>% 
  summarize(
    M  = mean(income),
    SD = sd(income)
    )

# A tibble: 1 × 2
      M    SD
  <dbl> <dbl>
1  53.7  14.6

Computations on Groups

The group_by() function groups the data by a specified variable. By itself, this function essentially does nothing. But it is powerful when the grouped output is piped to other functions, such as summarize(). Here we use group_by(gender) to compute the mean income and the standard deviation of the incomes for both males and females.

city %>% 
  group_by(gender) %>%
  summarize(
    M  = mean(income),
    SD = sd(income)
    )

# A tibble: 2 × 3
  gender         M    SD
  <chr>      <dbl> <dbl>
1 female      48.9  13.3
2 Not female  59.9  14.2

You can also use group_by() with multiple attributes. Simply add additional column names in the group_by() function to create more conditional groups. For example to compute the mean income and standard deviation for males and females conditioned on political party, we can use the following syntax.

city %>% 
  group_by(gender, party) %>%
  summarize(
    M  = mean(income),
    SD = sd(income)
    )

# A tibble: 6 × 4
# Groups:   gender [2]
  gender     party           M    SD
  <chr>      <chr>       <dbl> <dbl>
1 female     Democrat     51.7 17.5 
2 female     Independent  48.6 12.0 
3 female     Republican   47.6 19.5 
4 Not female Democrat     53.0 14.9 
5 Not female Independent  61.6 10.4 
6 Not female Republican   70.8  9.41

This produces the summary measures for each of the combinations of the levels of gender and political affiliation in the data.