9 Tidying

Now that you know your dataset is clean, you’ll want to get it into a format that is amenable to your analysis. This is the tidying stage. To better understand what “tidy” data is and why we’re trying to get our data into this format, make sure you’ve read the tidyverse chapter.

9.1 Pivoting

The main culprit of untidy data is data that breaks the first rule: Not every column is a variable. Data in this format is sometimes called wide data - as you add more levels, you’ll add more columns and so the data will get wider. Conversely, tidy data will get longer as you add more levels.

So there two different forms of data:

  • Long
  • Wide

Let’s look at use our Kaggle dataset as an example to see how the two forms differ in practice:

## # A tibble: 12,441 × 7
##    Year Name              NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
##   <dbl> <chr>                <dbl>    <dbl>    <dbl>       <dbl>        <dbl>
## 1  2006 Wii Sports            41.5    29.0      3.77        8.46         82.7
## 2  1985 Super Mario Bros.     29.1     3.58     6.81        0.77         40.2
## 3  2008 Mario Kart Wii        15.8    12.9      3.79        3.31         35.8
## # … with 12,438 more rows

What would happen if we added data for more countries? We’d be adding more columns, and so our data would get wider. So our clean_vg_sales data is currently in the wide format. If we had the country as its own column however, like this:

## # A tibble: 62,205 × 3
##    Year Country Sales
##   <dbl> <chr>   <dbl>
## 1  2006 NA      41.5 
## 2  2006 EU      29.0 
## 3  2006 JP       3.77
## 4  2006 Other    8.46
## 5  2006 Global  82.7 
## # … with 62,200 more rows

Now we’ve got the country as a variable, the more countries we add the longer the data is going to get. So we’ve converted our data from the wide format to the long format. To do this, we’ve pivoted the Sales columns. Let’s look at the two types of pivot.

Note: One thing to keep in mind is that long data does not always mean tidy, but wide data can never be tidy. It’s a subtle distinction but it’s important to remember.

9.1.1 Pivoting columns to rows (longer)

To convert wide data to long, we need to pivot the columns to rows. To this using {tidyr}, we use the pivot_longer() function because we want to pivot from the wide format to the long format.

To do this in the simplest way, we just need to tell the function which columns we want to pivot:

tidyr::pivot_longer(clean_vg_sales, cols = c(NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales)) %>%
  # We'll just look at the columns we're interested in for the moment
  dplyr::select(Name, Year, name, value) %>%
  print(n = 2)
## # A tibble: 62,205 × 4
##   Name        Year name     value
##   <chr>      <dbl> <chr>    <dbl>
## 1 Wii Sports  2006 NA_Sales  41.5
## 2 Wii Sports  2006 EU_Sales  29.0
## # … with 62,203 more rows

This is a good start. Now we’ve converted to long format, we’re abiding by the three rules and so we’ve got a tidy dataset! But there’s definitely some improvements to be done. Firstly, “name” and “value” aren’t the best names we could come up with for these columns, so we should probably use some new ones. To do this, we just need to provide new names to the names_to and values_to parameters:

tidyr::pivot_longer(clean_vg_sales,
                    cols = c(NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales),
                    names_to = "Country",
                    values_to = "Sales") %>% 
  dplyr::select(Name, Year, Country, Sales) %>%
  print(n = 2)
## # A tibble: 62,205 × 4
##   Name        Year Country  Sales
##   <chr>      <dbl> <chr>    <dbl>
## 1 Wii Sports  2006 NA_Sales  41.5
## 2 Wii Sports  2006 EU_Sales  29.0
## # … with 62,203 more rows

Secondly, our Country column has the country code suffixed with ’_Sales’ (e.g. “NA_Sales” instead of just “NA”). If the text we wanted to remove was before the country name, we could utilise the names_prefix parameter to remove matching text from the start of each variable name:

tidyr::pivot_longer(clean_vg_sales,
                    cols = c(NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales),
                    names_to = "Country",
                    values_to = "Sales",
                    names_prefix = "(_Sales)") # Remove text matching "_Sales" exactly

Because it’s at the end of the string though, we have to deal with it a bit differently. One way to remove that text would be to split the column name by the "_" and then just exclude the second part:

tidy_vg_sales <- clean_vg_sales %>% 
  tidyr::pivot_longer(cols = c(NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales),
                      names_to = c("Country", NA), values_to = "Sales", names_sep = "_"
  )

tidy_vg_sales %>%
  dplyr::select(Year, Country, Sales) %>%
  print(n = 5)
## # A tibble: 62,205 × 3
##    Year Country Sales
##   <dbl> <chr>   <dbl>
## 1  2006 NA      41.5 
## 2  2006 EU      29.0 
## 3  2006 JP       3.77
## 4  2006 Other    8.46
## 5  2006 Global  82.7 
## # … with 62,200 more rows

Here, we’ve specified that we want to convert our column names to two separate ‘names’ columns, but that we want to exclude the second column (that’s what the NA in the c("Country", NA) vector represents). Then, we’ve specified with the names_sep parameter that we want to split the columns by the _ character.

But what happens now if we get sent the same dataset but with more countries (e.g. a SA_Sales column)? We’d have to add those extra columns to our cols vector manually. Instead, we can use some tidyselect syntax to choose columns based on their features, like how they end:

clean_vg_sales %>% 
  tidyr::pivot_longer(cols = tidyselect::ends_with("_Sales"),
                      names_to = c("Country", NA), values_to = "Sales", names_sep = "_"
  ) 

Now our code will work with any number of columns as long as they end in ’_Sales’.

This is just one example of how we could utilise the pivot_longer() function to convert ‘messy’ data to the tidy format. In the wild you’ll get data that violates any of the three rules of tidy data in many different ways, and we just don’t have the space to go through it here. If you do need how to convert some of the other forms of messy data, then the {tidyr} package has a number of vignettes outlining many different ways to use the tidying tools it provides.

9.1.2 Pivoting rows to columns (wider)

Most of the time you should be going from wide to long, but we’ll go through how to do the reverse for the occasions where it’s required. To transform the data to the wide format, we use the pivot_wider() function. At the simplest level, we just need to provide where the name comes from and where the values come from with the names_from and values_from parameters respectively:

tidyr::pivot_wider(tidy_vg_sales, names_from = Country, values_from = Sales)
## # A tibble: 12,441 × 11
##     Rank Name     Platform  Year Genre  Publisher  `NA`    EU    JP Other Global
##    <dbl> <chr>    <chr>    <dbl> <chr>  <chr>     <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1     1 Wii Spo… Wii       2006 Sports Nintendo   41.5 29.0   3.77  8.46   82.7
##  2     2 Super M… NES       1985 Platf… Nintendo   29.1  3.58  6.81  0.77   40.2
##  3     3 Mario K… Wii       2008 Racing Nintendo   15.8 12.9   3.79  3.31   35.8
##  4     4 Wii Spo… Wii       2009 Sports Nintendo   15.8 11.0   3.28  2.96   33  
##  5     5 Pokemon… GB        1996 Role-… Nintendo   11.3  8.89 10.2   1      31.4
##  6     6 Tetris   GB        1989 Puzzle Nintendo   23.2  2.26  4.22  0.58   30.3
##  7     7 New Sup… DS        2006 Platf… Nintendo   11.4  9.23  6.5   2.9    30.0
##  8     8 Wii Play Wii       2006 Misc   Nintendo   14.0  9.2   2.93  2.85   29.0
##  9     9 New Sup… Wii       2009 Platf… Nintendo   14.6  7.06  4.7   2.26   28.6
## 10    10 Duck Hu… NES       1984 Shoot… Nintendo   26.9  0.63  0.28  0.47   28.3
## # … with 12,431 more rows

To get our data more like we started, we can use the names_glue to provide a {glue} string argument to create the column names from a template:

tidyr::pivot_wider(tidy_vg_sales, names_from = Country, values_from = Sales,
                   names_glue = "{Country}_Sales") %>%
  dplyr::select(Year, Name, tidyselect::ends_with("_Sales")) %>%
  print(n = 5)
## # A tibble: 12,441 × 7
##    Year Name                 NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
##   <dbl> <chr>                   <dbl>    <dbl>    <dbl>       <dbl>        <dbl>
## 1  2006 Wii Sports               41.5    29.0      3.77        8.46         82.7
## 2  1985 Super Mario Bros.        29.1     3.58     6.81        0.77         40.2
## 3  2008 Mario Kart Wii           15.8    12.9      3.79        3.31         35.8
## 4  2009 Wii Sports Resort        15.8    11.0      3.28        2.96         33  
## 5  1996 Pokemon Red/Pokemon…     11.3     8.89    10.2         1            31.4
## # … with 12,436 more rows

9.2 Separating & Uniting

9.2.1 Separating

A less common violation of the tidy data principles occurs when a single column contains more than one variable. Let’s look at an example:

## # A tibble: 4 × 2
##   Species    LeafArea
##   <chr>      <chr>   
## 1 Eucalyptus 20x100  
## 2 Elm        10x5    
## 3 Agave      2x10    
## 4 Persea     15x11

Here we’ve got a dataset with different plant species and the area of their leaves. But we’ve got two variables combined into a single column; we’ve got both the width and the length of the leaf included in a single ‘area’ variable. To convert this dataset to the tidy format, we need to split out those values into separate ‘width’ and ‘length’ variables.

We can use the separate() function from the {tidyr} package to do this. We just need to provide the column we want to split via the col parameter, the column names that we want to split the original column into with the into parameter, and the separator that’s separating the two or more values within the column via the sep parameter:

separate_example <- tidyr::separate(combined_example, col = LeafArea, into = c("Width", "Length"), sep = "x")
print(separate_example)
## # A tibble: 4 × 3
##   Species    Width Length
##   <chr>      <chr> <chr> 
## 1 Eucalyptus 20    100   
## 2 Elm        10    5     
## 3 Agave      2     10    
## 4 Persea     15    11

9.2.2 Uniting

Just like pivoting wider was essentially the inverse of pivoting longer, uniting columns is the inverse of separating them. Again, our focus in this chapter is to tidy up our data, and so we’ll likely always be separating columns rather than uniting them. Regardless, it’s good to understand how to unite columns for the occasions that it is required.

To join multiple columns together, we use the unite() function from the {tidyr} package. We just need to provide the name of the new column we want to create via the col parameter, the column names of the columns to unite as unnamed parameters, and then the separating character we want to use via sep:

tidyr::unite(separate_example, col = "LeafArea", Width, Length, sep = "x")
## # A tibble: 4 × 2
##   Species    LeafArea
##   <chr>      <chr>   
## 1 Eucalyptus 20x100  
## 2 Elm        10x5    
## 3 Agave      2x10    
## 4 Persea     15x11