2 Questions to Ask About Data
One thing that is required in data visualization is data. Anytime you encounter data there are several questions that you should ask (Gould & Ryan, 2013). These questions not only help you become familiar with the data, but also aid in contextualizing what the data can tell you and what they can’t.
- What are the cases and how many are in the data?
- What are the attribute names and how many attributes are included in the data?
- What does each attribute represent?
- Which attributes are categorical? Which are quantitative?
- Who collected the data?
- For what purpose (why) were the data collected?
- How and where were the data collected?
- Are there any issues or ethical considerations to be aware of with the data?
The first four questions are about the actual data collected. They are asked to get a sense for the information in the dataset. This is important in thinking about the appropriate methods we might use to visualize the data. The last four questions help you contextualize the data. They help you in thinking about how you will communicate findings and trends from any potential visualization of the data.
2.1 NYT Bestselling Books Data: An Example
Consider the following data which constitute a random sample of 25 New York Times bestselling hardcover fiction books (Pruett, 2021). These books were sampled from a list of all unique titles that appeared in the weekly New York Times bestselling hardcover fiction books list from 1931–2020. Take a minute to look familiarize yourself with these data.
These data have a tabular structure, that is, it is organized into rows and columns. In well-organized tabular data, the rows represent cases (also called units) and the columns represent attributes (also referred to as variables). Attributes are the information or characteristics that are collected for each of the cases.
FYI
Tabular data are required in order to carry out most analyses with computational tools, although not all data are tabular. For example, many websites store data in non-tabular forms such as XML or JSON. In this course, all the data you work with will be tabular.
Now let’s try to answer the seven questions about these data.
1. What are the cases and how many are in the data?
In these data, each case is a NYT best selling book. That is, each row represents a different NYT best selling hardcover fiction book. There are 25 total cases (books) in the data.
2. What are the attribute names and how many attributes are included in the data?
There are eight attributes in these data. They are:
yeartitleauthorfemaletotal_weeksfirst_weekdebut_rankbest_rank
As you identify the attribute names, pay attentiton to case (upper- vs lowercase letters), etc. You will ultimately be typing the names of the data’s attributes when you create your visualizations, and you need to type them exactly correct.
3. What does each attribute represent?
Answering this question requires that you provide a human understandable description of each attribute. This also often includes any units of measurement and sometimes the format of the data.
The attributes in these data are:
year: The year the book was published (formatted as YYYY).title: The title of the book (formatted in all uppercase).author: The author of the book (formatted in title case).female: Whether the author identifies as a female (0 = no, 1 = yes).total_weeks: The total number of weeks the book was on the NYT Best Sellers list.first_week: The first week the book appeared on the NYT Best Sellers list (formatted as m/d/yy).debut_rank: The book’s debut rank on the NYT Best Sellers list.best_rank: The book’s highest rank while it was on the NYT Best Sellers list.
As you do this you may have follow-up questions that arise. For example what are the possible ranks in the NYT Best Sellers list? Is it 1–100 or 1–25? This often requires doing some work to find out additional informations. Here is a nice blog post explaining the NYT bestsellers list. The larger dataset that our data were sampled from can be found here. This suggests that the ranking could be anywhere from 1–17.
To answer the next two questions, we need to understand how data scientists classify attributes.
2.2 Classifying Attributes
Our ultimate goal in creating a data visualization is typically to learn from or understand patterns in the data. For example, in our NYT Best Seller data, we may be interested in the proportion of authors that identify as female. Or, we may want to. know how many weeks a book stays on the Best Sellers list. The type of analyses we can do, however, depend on the type of attributes we have.
We typically classify attributes as either categorical attributes or quantitative attributes. These classifications are based on the type of information (data) in the attribute. A categorical attribute has values that represent categorical (or qualitative) differences between the cases, whereas a quantitative attribute represents numerical (or quantitative) differences between cases. For example, in the NYT Best Seller data, title and author are categorical variables, whereas year, and total number of weeks the book was on the NYT Best Sellers list are quantitative attributes.
Typically attributes that have numerical values are quantitative, but not always. In our data, consider the attribute that indicates whether the author identifies as a female. Although the values in the data are numeric, these numbers actually represent different categories: 0 = no (not female) and 1 = yes (female). Therefore, this attribute is actually a categorical attribute, not a quantitative attribute.
One check of whether an attribute is actually quantitative is whether numeric computations, such as finding an average of the attribute, can be carried out and the result makes conceptual sense. For example, we cannot compute the mean author value (it is thus a categorical attribute). If we compute the mean of the female attribute we get a result, but it does not indicate anything about the gender identity of a NYT best selling author. The mean does not make conceptual sense and thus we classify female as a categorical attribute.
4. Which attributes are categorical? Which are quantitative?
Categorical attributes include:
- The title of the book,
- The author of the book,
- Whether the author identifies as a female (0 = no, 1 = yes),
Quantitative attributes include:
- The year the book was published,
- The total number of weeks the book was on the NYT Best Sellers list,
- The first week the book appeared on the NYT Best Sellers list,
- The book’s debut rank on the NYT Best Sellers list, and
- The book’s highest rank while it was on the NYT Best Sellers list.
2.3 Context and Background
The answers to the last three questions give us information about the data’s context, which is key to drawing reasonable conclusions and interpreting any data visualization. If you are the one collecting the data, you should record all of this information so others have access to it. For the NYT best sellers data here are the answers to these questions:
5. Who collected the data?
Jordan Pruett collected the data for the Post45 Data Collective. The Post45 Data Collective peer reviews and houses post-1945 literary data on an open-access website designed, hosted, and maintained by Emory University’s Center for Digital Scholarship.
According to LinkedIn Jordan Pruett is a Data Scientist and Python developer working in the Greater Seattle Area at Lyssn (a company helping providers in health and human services and wellness improve engagement, satisfaction, and outcomes).
6. For what purpose (why) were the data collected?
As per Post45 Data Collective,
“These datasets provide valuable metadata for researchers of 20th century American literature working in fields such as cultural analytics, book and publishing history, and the sociology of literature. In cultural analytics, recent scholarship has used bestseller status as a rough proxy for popularity, enabling researchers to computationally model the textual boundaries between, for instance, popular and prizewinning fiction (Algee-Hewitt and McGurl 2015; Piper et al. 2016b; English 2016). Previous research of this kind has often relied on the Publishers Weekly annual bestseller list. Although Publishers Weekly also publishes a weekly list, it is not readily accessible to researchers. In contrast to the Publishers Weekly annual list, this dataset reports weekly bestsellers, and therefore captures a much broader subset of the historical literary marketplace.”
7. How and where were the data collected?
These data were scraped from Hawes Publications, an online repository that publishes a PDF transcript of the list for every year of the last going back to 1931, using the open-source Python library pdfminer.
8. Are there any issues or ethical considerations to be aware of with the data?
By understanding who collected the data, for what purpose, and how it was collected we can start to evaluate things that might distort the information being communicated (e.g., bias in which data were collected). For example if the data of best selling books was made available by Random House (a major publishing company), we might be worried about whether books from other publishers were being fairly included. Additionally, we want to consider any other ethical considerations. For example, we would want to not only consider things like anonymity or privacy, but also whether the data should be used for something that it wasn’t originally intended for.
The Post45 Data Collective lays out several ethical considerations, including:
- Considering the limitations of drawing historical or cultural conclusions from bestseller data;
- Hardcover sales at bookstores are especially unrepresentative of the broader book market in the early years of the “paperback revolution” after WWII, when most popular novels were sold in paperback format at non-bookstore outlets like drugstores.
- The Times only expanded its coverage to include nationwide bestsellers in September of 1945. Before that, entries are based on sales in New York or other metropolitan areas.
2.4 Data Dictionaries: Documentation for the Data
The answers to these eight questions about the data are often compiled into a document called a data dictionary or a data codebook. In this class you will often be provided with a data dictionary along with the data. You should always read through the data dictionary. This provides important information about the data that helps you interpret and use the data more effectively. If you collect data as part of a study or project, you should create a dictionary for the data.
A data dictionary for the New York Times bestseller data is shown below.
Data Dictionary for nyt-best-sellers.csv
The data in nyt-best-sellers.csv is a sample from a larger dataset collected by Jordan Pruett for the Post45 Data Collective. The Post45 Data Collective peer reviews and houses post-1945 literary data on an open-access website designed, hosted, and maintained by Emory University’s Center for Digital Scholarship.
The larger dataset was obtained by scraping Hawes Publications, an online repository that publishes a PDF transcript of the list for every year of the last going back to 1931, using the open-source Python library pdfminer. The 25 books in the data were randomly sampled from a list of all unique titles that appeared in the weekly New York Times bestselling hardcover fiction books list from 1931–2020. The attributes in this data include:
year: The year the book was published (formatted as YYYY).title: The title of the book (formatted in all uppercase).author: The author of the book (formatted in title case).female: Whether the author identifies as a female (0 = no, 1 = yes).total_weeks: The total number of weeks the book was on the NYT Best Sellers list.first_week: The first week the book appeared on the NYT Best Sellers list (formatted as m/d/yy).debut_rank: The book’s debut rank on the NYT Best Sellers list.best_rank: The book’s highest rank while it was on the NYT Best Sellers list.
Exercises: Your Turn
Inside Airbnb is a mission driven project that provides data and advocacy about Airbnb’s impact on residential communities. The data below constitutes a random sample of Airbnb listings from the Twin Cities scraped by Inside Airbnb. Use these data along with information form the data dictionary to answer questions 1–4 below.
- What are the cases and how many are in the data?
- What are the attribute names and how many attributes are included in the data?
- What does each attribute represent?
- Which attributes are categorical? Which are quantitative?
Inside Airbnb listings includes a lot of information about the data in “Data Resources” on their website. In particular the “Data Dictionary” and “Data Assumptions” pages have a great deal of useful information.
- Who/what organization collected the data?
- For what purpose (why) were the data collected?
- How and where were the data collected?
- Are there any issues or ethical considerations to be aware of with the data?