3 “Tidy” data

The concept of “tidy” data is something that’s important throughout the set of tidyverse packages. There is an in-depth paper published in the Journal of Statistical Software, authored by Hadley Wickham, that describes the concept and requirements of tidy data in full. But for now we’re just going to look at the basics.

In order for a dataset to be tidy, it needs to abide by three rules:

Every column is a variable
Every row is an observation
Every cell is a single value

If a dataset doesn’t abide by all three of these rules, then it’s a “messy” dataset.

Let’s compare a dataset in both a ‘messy’ and ‘tidy’ format. Let’s imagine we have this dataset with the number of cases of emphysema in different countries:

## # A tibble: 3 × 5
##   Country Month1 Month2 Month3 Month4
##   <chr>    <dbl>  <dbl>  <dbl>  <dbl>
## 1 UK          10     15     20     25
## 2 US           1      2      3      4
## 3 France      55    110    220    440

## # A tibble: 12 × 3
##    Country Month Cases
##    <chr>   <int> <dbl>
##  1 UK          1    10
##  2 UK          2    15
##  3 UK          3    20
##  4 UK          4    25
##  5 US          1     1
##  6 US          2     2
##  7 US          3     3
##  8 US          4     4
##  9 France      1    55
## 10 France      2   110
## 11 France      3   220
## 12 France      4   440

In the first representation, we’ve got a separate column for each month, meaning that each column is not a different variable. Instead, our one variable (the number of cases) is spread across multiple columns. This is therefore a ‘messy’ way of representing this dataset.

In our second representation, we’ve changed the Month to a column, meaning that we now have a single column for our cases column. In this format, the dataset is ‘tidy’.

Later we’ll look at how we can convert messy datasets to tidy ones, and back again for the rare cases you’ll need to make things messy.

The benefit of having your data in a tidy format is fairly simple; it provides a standard for structuring a dataset. This means that tools can be developed with the assumption that the data will be in a specific format, making the development of these data science tools easier. And this is essentially what underpins the entire tidyverse: get your data in a tidy format and you’ll have access to everything the tidyverse provides.

3.1 Questions

If we also had the number of cancer cases in our example dataset, what would be the ‘tidy’ way of adding that data? Would it be A or B below? Why?

## # A tibble: 1 × 4
##   Country Month Emphysema Cancer
##   <chr>   <dbl>     <dbl>  <dbl>
## 1 UK          1        10    100

## # A tibble: 2 × 4
##   Country Month Disease   Cases
##   <chr>   <dbl> <chr>     <dbl>
## 1 UK          1 Emphysema    10
## 2 UK          1 Cancer      100