9 Tidying
Now that you know your dataset is clean, you’ll want to get it into a format that is amenable to your analysis. This is the tidying stage. To better understand what “tidy” data is and why we’re trying to get our data into this format, make sure you’ve read the tidyverse chapter.
9.1 Pivoting
The main culprit of untidy data is data that breaks the first rule: Not every column is a variable. Data in this format is sometimes called wide data - as you add more levels, you’ll add more columns and so the data will get wider. Conversely, tidy data will get longer as you add more levels.
So there two different forms of data:
- Long
- Wide
Let’s look at use our Kaggle dataset as an example to see how the two forms differ in practice:
## # A tibble: 12,441 × 7
## Year Name NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2006 Wii Sports 41.5 29.0 3.77 8.46 82.7
## 2 1985 Super Mario Bros. 29.1 3.58 6.81 0.77 40.2
## 3 2008 Mario Kart Wii 15.8 12.9 3.79 3.31 35.8
## # … with 12,438 more rows
What would happen if we added data for more countries? We’d be adding more columns, and so our data would get wider. So our clean_vg_sales
data is currently in the wide format. If we had the country as its own column however, like this:
## # A tibble: 62,205 × 3
## Year Country Sales
## <dbl> <chr> <dbl>
## 1 2006 NA 41.5
## 2 2006 EU 29.0
## 3 2006 JP 3.77
## 4 2006 Other 8.46
## 5 2006 Global 82.7
## # … with 62,200 more rows
Now we’ve got the country as a variable, the more countries we add the longer the data is going to get. So we’ve converted our data from the wide format to the long format. To do this, we’ve pivoted the Sales columns. Let’s look at the two types of pivot.
Note: One thing to keep in mind is that long data does not always mean tidy, but wide data can never be tidy. It’s a subtle distinction but it’s important to remember.
9.1.1 Pivoting columns to rows (longer)
To convert wide data to long, we need to pivot the columns to rows. To this using {tidyr}
, we use the pivot_longer()
function because we want to pivot from the wide format to the long format.
To do this in the simplest way, we just need to tell the function which columns we want to pivot:
::pivot_longer(clean_vg_sales, cols = c(NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales)) %>%
tidyr # We'll just look at the columns we're interested in for the moment
dplyr::select(Name, Year, name, value) %>%
print(n = 2)
## # A tibble: 62,205 × 4
## Name Year name value
## <chr> <dbl> <chr> <dbl>
## 1 Wii Sports 2006 NA_Sales 41.5
## 2 Wii Sports 2006 EU_Sales 29.0
## # … with 62,203 more rows
This is a good start. Now we’ve converted to long format, we’re abiding by the three rules and so we’ve got a tidy dataset! But there’s definitely some improvements to be done. Firstly, “name” and “value” aren’t the best names we could come up with for these columns, so we should probably use some new ones. To do this, we just need to provide new names to the names_to
and values_to
parameters:
::pivot_longer(clean_vg_sales,
tidyrcols = c(NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales),
names_to = "Country",
values_to = "Sales") %>%
dplyr::select(Name, Year, Country, Sales) %>%
print(n = 2)
## # A tibble: 62,205 × 4
## Name Year Country Sales
## <chr> <dbl> <chr> <dbl>
## 1 Wii Sports 2006 NA_Sales 41.5
## 2 Wii Sports 2006 EU_Sales 29.0
## # … with 62,203 more rows
Secondly, our Country column has the country code suffixed with ’_Sales’ (e.g. “NA_Sales” instead of just “NA”). If the text we wanted to remove was before the country name, we could utilise the names_prefix
parameter to remove matching text from the start of each variable name:
::pivot_longer(clean_vg_sales,
tidyrcols = c(NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales),
names_to = "Country",
values_to = "Sales",
names_prefix = "(_Sales)") # Remove text matching "_Sales" exactly
Because it’s at the end of the string though, we have to deal with it a bit differently. One way to remove that text would be to split the column name by the "_" and then just exclude the second part:
clean_vg_sales %>%
tidy_vg_sales <- tidyr::pivot_longer(cols = c(NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales),
names_to = c("Country", NA), values_to = "Sales", names_sep = "_"
)
%>%
tidy_vg_sales dplyr::select(Year, Country, Sales) %>%
print(n = 5)
## # A tibble: 62,205 × 3
## Year Country Sales
## <dbl> <chr> <dbl>
## 1 2006 NA 41.5
## 2 2006 EU 29.0
## 3 2006 JP 3.77
## 4 2006 Other 8.46
## 5 2006 Global 82.7
## # … with 62,200 more rows
Here, we’ve specified that we want to convert our column names to two separate ‘names’ columns, but that we want to exclude the second column (that’s what the NA
in the c("Country", NA)
vector represents). Then, we’ve specified with the names_sep
parameter that we want to split the columns by the _
character.
But what happens now if we get sent the same dataset but with more countries (e.g. a SA_Sales
column)? We’d have to add those extra columns to our cols
vector manually. Instead, we can use some tidyselect
syntax to choose columns based on their features, like how they end:
%>%
clean_vg_sales tidyr::pivot_longer(cols = tidyselect::ends_with("_Sales"),
names_to = c("Country", NA), values_to = "Sales", names_sep = "_"
)
Now our code will work with any number of columns as long as they end in ’_Sales’.
This is just one example of how we could utilise the pivot_longer()
function to convert ‘messy’ data to the tidy format. In the wild you’ll get data that violates any of the three rules of tidy data in many different ways, and we just don’t have the space to go through it here. If you do need how to convert some of the other forms of messy data, then the {tidyr}
package has a number of vignettes outlining many different ways to use the tidying tools it provides.
9.1.2 Pivoting rows to columns (wider)
Most of the time you should be going from wide to long, but we’ll go through how to do the reverse for the occasions where it’s required. To transform the data to the wide format, we use the pivot_wider()
function. At the simplest level, we just need to provide where the name comes from and where the values come from with the names_from
and values_from
parameters respectively:
::pivot_wider(tidy_vg_sales, names_from = Country, values_from = Sales) tidyr
## # A tibble: 12,441 × 11
## Rank Name Platform Year Genre Publisher `NA` EU JP Other Global
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Wii Spo… Wii 2006 Sports Nintendo 41.5 29.0 3.77 8.46 82.7
## 2 2 Super M… NES 1985 Platf… Nintendo 29.1 3.58 6.81 0.77 40.2
## 3 3 Mario K… Wii 2008 Racing Nintendo 15.8 12.9 3.79 3.31 35.8
## 4 4 Wii Spo… Wii 2009 Sports Nintendo 15.8 11.0 3.28 2.96 33
## 5 5 Pokemon… GB 1996 Role-… Nintendo 11.3 8.89 10.2 1 31.4
## 6 6 Tetris GB 1989 Puzzle Nintendo 23.2 2.26 4.22 0.58 30.3
## 7 7 New Sup… DS 2006 Platf… Nintendo 11.4 9.23 6.5 2.9 30.0
## 8 8 Wii Play Wii 2006 Misc Nintendo 14.0 9.2 2.93 2.85 29.0
## 9 9 New Sup… Wii 2009 Platf… Nintendo 14.6 7.06 4.7 2.26 28.6
## 10 10 Duck Hu… NES 1984 Shoot… Nintendo 26.9 0.63 0.28 0.47 28.3
## # … with 12,431 more rows
To get our data more like we started, we can use the names_glue
to provide a {glue}
string argument to create the column names from a template:
::pivot_wider(tidy_vg_sales, names_from = Country, values_from = Sales,
tidyrnames_glue = "{Country}_Sales") %>%
dplyr::select(Year, Name, tidyselect::ends_with("_Sales")) %>%
print(n = 5)
## # A tibble: 12,441 × 7
## Year Name NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2006 Wii Sports 41.5 29.0 3.77 8.46 82.7
## 2 1985 Super Mario Bros. 29.1 3.58 6.81 0.77 40.2
## 3 2008 Mario Kart Wii 15.8 12.9 3.79 3.31 35.8
## 4 2009 Wii Sports Resort 15.8 11.0 3.28 2.96 33
## 5 1996 Pokemon Red/Pokemon… 11.3 8.89 10.2 1 31.4
## # … with 12,436 more rows
9.2 Separating & Uniting
9.2.1 Separating
A less common violation of the tidy data principles occurs when a single column contains more than one variable. Let’s look at an example:
## # A tibble: 4 × 2
## Species LeafArea
## <chr> <chr>
## 1 Eucalyptus 20x100
## 2 Elm 10x5
## 3 Agave 2x10
## 4 Persea 15x11
Here we’ve got a dataset with different plant species and the area of their leaves. But we’ve got two variables combined into a single column; we’ve got both the width and the length of the leaf included in a single ‘area’ variable. To convert this dataset to the tidy format, we need to split out those values into separate ‘width’ and ‘length’ variables.
We can use the separate()
function from the {tidyr}
package to do this. We just need to provide the column we want to split via the col
parameter, the column names that we want to split the original column into with the into
parameter, and the separator that’s separating the two or more values within the column via the sep
parameter:
tidyr::separate(combined_example, col = LeafArea, into = c("Width", "Length"), sep = "x")
separate_example <-print(separate_example)
## # A tibble: 4 × 3
## Species Width Length
## <chr> <chr> <chr>
## 1 Eucalyptus 20 100
## 2 Elm 10 5
## 3 Agave 2 10
## 4 Persea 15 11
9.2.2 Uniting
Just like pivoting wider was essentially the inverse of pivoting longer, uniting columns is the inverse of separating them. Again, our focus in this chapter is to tidy up our data, and so we’ll likely always be separating columns rather than uniting them. Regardless, it’s good to understand how to unite columns for the occasions that it is required.
To join multiple columns together, we use the unite()
function from the {tidyr}
package. We just need to provide the name of the new column we want to create via the col
parameter, the column names of the columns to unite as unnamed parameters, and then the separating character we want to use via sep
:
::unite(separate_example, col = "LeafArea", Width, Length, sep = "x") tidyr
## # A tibble: 4 × 2
## Species LeafArea
## <chr> <chr>
## 1 Eucalyptus 20x100
## 2 Elm 10x5
## 3 Agave 2x10
## 4 Persea 15x11