4.3 Cleaning and tidying data

Loading data is often just the first step in your project. Most of the time, you’ll have messy datasets with odd columns and missing data points that you’ll need to deal with before you can actually do any meaningful analysis: you’ll need to clean your data.

You’ll also need to get it into a format that is amenable to your analysis. This is the tidying stage, and because cleaning and tidying data are so tightly related, we’ll do these steps at the same time.

4.3.1 Cleaning

Here are some of the more common operations that you’ll be doing when it comes to data cleaning:

  • Removing/replacing missing values
  • Changing column types
  • Combining columns
  • Renaming columns
  • Checking for anomalies

Let’s look at how we might do these tasks in R using the datasets::airquality dataset as an example.

head(datasets::airquality, 10)
##    Ozone Solar.R Wind Temp Month Day
## 1     41     190  7.4   67     5   1
## 2     36     118  8.0   72     5   2
## 3     12     149 12.6   74     5   3
## 4     18     313 11.5   62     5   4
## 5     NA      NA 14.3   56     5   5
## 6     28      NA 14.9   66     5   6
## 7     23     299  8.6   65     5   7
## 8     19      99 13.8   59     5   8
## 9      8      19 20.1   61     5   9
## 10    NA     194  8.6   69     5  10

4.3.1.1 Missing values

Missing values are common in data science - data collection is often imperfect and so you’ll end up with observations or data-points missing. Firstly, you need to decide what you’re going to do with those.

The easiest approach to just to remove them, and we can do that with the dplyr::filter() function:

## To remove rows with NA in one column
datasets::airquality %>% 
  dplyr::filter(!is.na(Ozone)) %>%
  head(5)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    28      NA 14.9   66     5   6
## To remove rows with NA in any column
datasets::airquality %>% 
  dplyr::filter(dplyr::across(dplyr::everything(), ~!is.na(.x))) %>%
  head(5)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    23     299  8.6   65     5   7

Another approach to replace them with either the average for that column or with the closest neighbour value (this works better with time series data).

To replace with the nearest value, we can use the tidyr::fill() function:

## Fill a single column
datasets::airquality %>%
  tidyr::fill(Ozone, .direction = "downup") %>%
  head(5)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    18      NA 14.3   56     5   5
datasets::airquality %>%
  tidyr::fill(dplyr::everything(), .direction = "downup") %>%
  head(5)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    18     313 14.3   56     5   5

To replace them with the mean, we can either use a package like zoo, or we can use the tidyverse packages and our own function:

replace_with_mean <- function(x) {
  replace(x, is.na(x), mean(x, na.rm = TRUE))
}

datasets::airquality %>%
  dplyr::mutate(dplyr::across(dplyr::everything(), replace_with_mean)) %>%
  head(5)
##      Ozone  Solar.R Wind Temp Month Day
## 1 41.00000 190.0000  7.4   67     5   1
## 2 36.00000 118.0000  8.0   72     5   2
## 3 12.00000 149.0000 12.6   74     5   3
## 4 18.00000 313.0000 11.5   62     5   4
## 5 42.12931 185.9315 14.3   56     5   5

A word of warning here, however. If you’re doing complex modelling or very sensitive analyses, filling values like this can be misleading at best. Always think about the best approach for your specific project and what the repercussions of filling empty values might be.

4.3.1.2 Changing column types

When you import data into R, sometimes the type of the column doesn’t match what you want it to be. One way of tackling this is to define your column types when you import the data (as we looked at before), but it’s also perfectly acceptable to change the column type after the import.

Probably the most common conversion is from a character string to a date. For this example, we’re just going to use some test data:

bad_tibble <- tibble::tribble(~bad_date, ~value,
                              "2012/01/01", 100,
                              "2014/06/01", 200)

bad_tibble %>%
  dplyr::mutate(good_date = as.Date(bad_date, format = "%Y/%m/%d"))
## # A tibble: 2 x 3
##   bad_date   value good_date 
##   <chr>      <dbl> <date>    
## 1 2012/01/01   100 2012-01-01
## 2 2014/06/01   200 2014-06-01

Essentially, all you need to do is wrap your conversion function (e.g. as.Date(), as.character()) in a dplyr::mutate() call and you should be able to change your columns to whatever you need.

4.3.1.3 Combining columns