5  Summarizing and Visualizing Categorical Attributes

This chapter will focus on the visualization and numerical summarization of categorical attributes.


5.1 Hypothetical Example: Pet Ownership

Imagine you have surveyed 10 pet owners about the type of pet they own.1 The data you collected is shown in Figure 5.1.

Figure 5.1: Data collected from 10 pet owners about the type of pet the own.

Our goal in exploratory analysis is to describe the data. One way of describing the data is to list all of the values. For example here we could say the sample included a turtle, a fish, a cat, a dog, another dog, another fish, another cat, another cat, another cat, and another dog. While this is an accurate description, it isn’t very generalizable. (Imagine trying to describe the data from 1000 pet owners or 10,000 pet owners!)


5.1.1 Numerically Summarizing Categorical Attributes

A more natural way to describe these data is to summarize them by providing counts of each pet type. For example, describing our sample data using counts:

  • 1 of the pet owners sampled owned a turtle,
  • 2 of the pet pet owners sampled owned a fish,
  • 3 of the pet pet owners sampled owned a dog, and
  • 4 of the pet pet owners sampled owned a cat.

Summarizing each type of pet owned by reporting counts of them is a much more natural way of describing the data. (This is also useful when the sample size is much larger.)

Another summary that could be used to describe this sample is to give the proportion of each type of pet owned. To compute the proportion, we take the count of each type of pet owned, and divide it by the total sample size.

\[ \mathrm{Proportion} = \frac{\mathrm{Count~of~Pet~Type}}{\mathrm{Total~Sample~Size}} \]

For example, to compute the proportion of pet owners in our sample that owned a dog, we use:

\[ \mathrm{Proportion~of~Dogs} = \frac{3}{10} = 0.30 \]

Proportions will always be between 0 and 1. If you add all of the proportions of each category together you will get 1, so long as values can only belong to one category.

Describing our sample data using proportions:

  • 0.10 of the pet owners sampled owned a turtle,
  • 0.20 of the pet pet owners sampled owned a fish,
  • 0.30 of the pet pet owners sampled owned a dog, and
  • 0.40 of the pet pet owners sampled owned a cat.

The count and proportion values are often reported in a table, especially if there are more than a couple values in the categorical attribute. Table 5.1 is an example table indicating the counts and proportions of values in our hypothetical pet example.

Table 5.1: Counts and proportions of pet owners who own each type of pet.
Pet Count Proportion
3 0.30
4 0.40
2 0.20
1 0.10


5.1.2 Visualizing Categorical Attributes

To visualize categorical attributes we typically use a bar chart. Figure 5.2 shows a bar chart of the pet data.

Figure 5.2: Bar chart indicating the counts of each type of pet owned.

A bar chart (also known as a bar graph) shows a bar for each category of the categorical attribute. In our example, we have four bars, one for each pet type. The height of the bar indicates the count of each category. For example, the bar for cats has a height of four on the y-axis.

Sometimes the axes in the bar chart are transposed; categories are placed on the y-axis and counts on the x axis. Also, you might see a bar chart indicating proportions rather than counts. Figure 5.3 shows a transposed bar chart indicating the proportion of each pet type.

Figure 5.3: Bar chart indicating the counts of each type of pet owned. In this plot the categories are placed on the y-axis and the scale on the the x axis indicates proportions rather than counts.

When proportions are used in a bar chart, it is coventional to extend that axis from 0 to 1 (the range of potential proportions).


5.1.2.1 Bar Chart Variations

There are several variations on the bar chart that you may see in practice. For example, the segmented bar chart is a variation of the bar chart. This variation of the plot, which always uses proportions rather than counts, has a single bar that is split into segments—one for each category. A segmented bar chart summarizing the pet data is shown in Figure 5.4.

Figure 5.4: Segmented bar chart indicating the proportion of each type of pet owned.

Another variation of the bar chart is the donut chart. A donut chart is simply a segmented bar chart that is presented in a circular layout. Figure 5.5 presents a donut chart summarizing the pet data. Because there is no axis to indicate the proportion of each category in a donut chart, it is conventional to indicate the percentages of each category in the plot. Here percentages are used rather than proportions.

Figure 5.5: Donut chart indicating the percentage of each type of pet owned. A donut chart is simply a segmented bar chart presented in a circular layout.


5.1.2.2 Pie Charts

One last plot used to visualize summaries of categorical dat is the pie chart. Figure 5.6 shows a pie chart summarizing the pet data. Unlike any of the bar charts that we have looked at, a pie chart uses the proportional area of a circle to indicate the proportion or percentage of each pet type rather than the length of a bar.

Figure 5.6: Pie chart indicating the percentage of each type of pet owned. A pie chart uses the proportional area of a circle to indicate the proportion or percentage of each pet type.

In a pie chart, each section of the pie is determined by proportionally dividing the 360° of the circle based on the data, and then making each section have the computed angle. For example the proportion of pet owners who have a cat is 0.4, and 0.4 of 360° is 144°—so the cats section has an angle of 144°. While these computations and the drawing of the pie chart would be done by the computer, it has implications for interpretations. Namely, research has suggested that humans may not be as adept at making accurate comparisons involving angles and the areas of sections based on those angles. Because of this, the recommendation from the data visualization community is to use bar charts rather than pie charts when displaying summaries of categorical data.


5.2 Using R to Numerically Summarize Categorical Attributes

To illustrate how we can summarize and visualize categorical attributes using R, we will use the college-scorecard.csv data. As a reminder, we will start by loading three libraries: {tidyverse}, {ggformula}, and {mosaicCore}. Then we will import the college scorecard data using the read_csv() function and assign the data into an object called colleges. Finally, we view the data to make sure it read in properly.

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.3.2
Warning: package 'tidyr' was built under R version 4.3.2
library(ggformula)
Warning: package 'ggridges' was built under R version 4.3.2
library(mosaicCore)

# Import data
colleges <- read_csv("https://raw.githubusercontent.com/zief0002/epsy-5261/main/data/college-scorecard.csv")

# View data
colleges

Recall that in these data each case in these data is an institution of higher education and there are 234 institutions in our sample. This dataset has several categorical attributes, including: state, region, type of institution, and control.

The first attribute we will summarize and visualize is the control variable. Looking at the data codebook we find that this attribute can take on three different values: Public, Private nonprofit, and Private for-profit. Our first task is to get institution counts for each category of control. To do this, we will use the df_stats() function. The general syntax to obtain these counts is shown below.

# General syntax to counts the categories in a categorical attribute
df_stats( ~ attribute_name, data = data_name, counts)

We first need to identify the name of the categorical attribute. To tell df_stats() that this is an attribute, we place a tilde (~) in front of the attribute name. Then we use the data= argument to identify the data object that includes our categorical attribute. Finally, we use counts to indicate that we want to compute the category counts.

It is a good idea to learn how to read R syntax. The tilde operator indicates a special kind of expression called a formula, and can be read as “model”. So the general syntax above is read as, “model the attribute_name found in the data_name data by counting the categories in the attribute”.

Putting this into practice to count the categories in the control attribute which is in our colleges data:

# Syntax to count the categories in the control attribute
df_stats(~control, data = colleges, counts)

If we were to read this syntax, “model the control attribute found in the colleges data by counting the categories in the attribute”.

Based on the counts, we find that most of the institutions of higher learning in our sample are private nonprofit institutions (n = 146). There are also several public institutions of higher learning in our sample (n = 71). Lastly, there are also a few private for-profit institutions of higher learning in our sample (n = 13).

It is common to use n or N to denote sample size. Some textbooks and authors will use N to indicate the overall sample size (e.g., in the college scorecard data, N = 234) and n to indicate the sample size of subgroups within the sample (e.g., n = 71 for public institutions). Other authors might use n to define the overall sample size (e.g., n = 234) and then use subscripts on n to denote the sample size of different groups (e.g., \(n_{\mathrm{Public}}=71\)). There is not a single unified agreed upon way to denote these things.

We also might want to compute the proportions for each category of control. To do this, we can again use the function df_stats(), but instead of providing counts we will provide props.

# Syntax to compute the proportion for each category in the control attribute
df_stats(~control, data = colleges, props)

These proportions (which can also be turned into percentages2) tell a similar story to what the counts did. Most of the institutions in our sample are private nonprofit (63.5%) and public (30.9%) institutions. There is a smaller percentage of institutions that are private for-profit (5.7%).3 As we did with the numerical summaries of the pet data, we can include these values in a table.

Table 5.2: Counts and proportions of institutions of higher learning by control.
Control Count Proportion
Public 71 0.309
Private Nonprofit 146 0.635
Private For-Profit 13 0.057


Your Turn

Write the syntax to compute the counts of the region attribute.

Write the syntax to compute the proportions of the region attribute.

5.3 Creating a Bar Chart to Summarize Counts using R

To create a bar chart, we will use the gf_counts() function. This general syntax for gf_counts() is

# General syntax to create a bar chart
gf_counts(~ attribute_name, data = data_name)

In this function we indicate the name of the attribute we want to create a bar chart for with a tilde (~) precedeing the attribute name. We also give the name of the data object in the data= argument. For example, the syntax to create a bar chart summarizing the counts of the control attribute is:

# Syntax to create a bar chart for the control attribute
gf_counts(~ control, data = colleges)
Figure 5.7: Bar chart summarizing the counts of the control attribute.

Reading the syntax: Create a bar chart by modeling the counts of the control attribute in the colleges data object.

We can make this plot nicer by changing the axes labels. For example, we might change the y-axis label to “Count” and the x-axis label to “Type of Institution”. To do this we include the xlab= and ylab= arguments in the gf_counts() function. The labels we want depicted are given as text inside of quotation marks. Remember that each argument needs to be separated by a comma!

As you include additional arguments in the function, it can be useful to include each argument on different lines. This will help you troubleshoot syntax that doesn’t work.

# Syntax to create a bar chart for the control attribute
gf_counts( 
  ~ control, data = colleges,
  xlab = "Type of Institution",
  ylab = "Count"
  )
Figure 5.8: Bar chart summarizing the counts of institutions of higher learning by type.

Your Turn

Write the syntax to create a bar chart that summarizes the counts of the region attribute.

Add better labels to the x- and y-axis of the bar chart you just created.


5.4 Creating a Bar Chart to Summarize Proportions using R

To create a bar chart that summarizes the proportion of each category (rather than counts) we can use the gf_props() function. The syntax for this function is identical to that of gf_counts(). The syntax to create a bar chart summarizing the proportion of the control attribute is:

# Syntax to create a bar chart for the control attribute
gf_props(
  ~control, data = colleges,
  xlab = "Type of Institution",
  ylab = "Count"
  )
Figure 5.9: Bar chart summarizing the proportions of institutions of higher learning by type.

You can also create a bar chart that summarizes the percentage of each category using the gf_percents() function.

# Syntax to create a bar chart for the control attribute
gf_percents(
  ~control, data = colleges,
  xlab = "Type of Institution",
  ylab = "Percent"
  )
Figure 5.10: Bar chart summarizing the percentage of institutions of higher learning by type.

Your Turn

Write the syntax to create a bar chart that summarizes the proportions of the region attribute. Also change the axis labels on both the x- and y-axes.


5.5 Creating Horizontal Bar Charts Using R

To create horizontal bar charts we can use add on (literally) the function coord_flip() , to gf_counts(), gf_props() or gf_percents(). For example, to create a horizontal bar chart summarizing the counts of each type of institution we can use the following syntax:

# Syntax to create a hirizontal bar chart for the control attribute
gf_counts(
  ~control, data = colleges,
  xlab = "Count",
  ylab = "Type of Institution"
  ) +
  coord_flip()
Figure 5.11: Horizontal bar chart summarizing the number of institutions of higher learning by type.

Your Turn

Write the syntax to create a horizontal bar chart that summarizes the percent of the region attribute. Also change the axis labels on both the x- and y-axes.


5.6 Summary

There are several R functions that are useful for summarizing categorical attributes. Table 6.1 shows the functions (and their descriptions) you will use to summarize and visualize categorical attributes. Note that they all have very parallel syntax.

Table 5.3: The functions and their descriptions to summarize and visualize categorical attributes.
Function Description
Summarize

df_stats(~attribute, data = data_object, counts)

Compute counts of a categorical attribute

df_stats(~attribute, data = data_object, props)

Compute proportions of a categorical attribute

df_stats(~attribute, data = data_object, percs)

Compute percentages of a categorical attribute
Visualize

gf_counts(~attribute, data = data_object)

Create bar chart of counts

gf_props(~attribute, data = data_object)

Create bar chart of proportions

gf_percents(~attribute, data = data_object)

Create bar chart of percentages

gf_counts(~attribute, data = data_object) + coord_flip()

Create horizontal bar chart of counts

gf_props(~attribute, data = data_object) + coord_flip()

Create horizontal bar chart of proportions

gf_percent(~attribute, data = data_object) + coord_flip()

Create horizontal bar chart of percentages

Lastly, there are several optional arguments you can include in the visualization (gf_) functions to improve the aesthetic quality of your plot. Some of these are listed in Table 5.4.

Table 5.4: Optional argumnents that can be included in the visualization (gf_) functions.
Argument
Labels

`xlab =

`ylab =

`title =

`subtitle =

`caption =

Color

`fill =

`color =

Remember to separate each argument with a comma if you include multiple arguments in the function.



  1. To keep it simple, assume each pet owner only has a single pet.↩︎

  2. You can also use df_stats() to compute percentages by providing percs.↩︎

  3. Note that these percentages do not quite add up to 100%. This is because of the rounding that we did.↩︎