8 Cleaning

Once you’ve got your data into R, you’ll likely need to clean it up a bit. Here are some of the more common operations that you’ll be doing when it comes to data cleaning:

  • Filtering
  • Removing/replacing missing values
  • Changing column types

Let’s look at how we might do these tasks in R using our Kaggle dataset as an example.

8.1 Filtering

In some cases, you’ll only want a subset of the data that’s been provided to you. To filter the dataset down, use the dplyr::filter() function and provide your criteria:

vg_sales %>%
  dplyr::filter(Year > 200) %>%
  print(n = 5)
## # A tibble: 16,327 x 11
##    Rank Name  Platform  Year Genre Publisher NA_Sales EU_Sales JP_Sales
##   <dbl> <chr> <chr>    <dbl> <chr> <chr>        <dbl>    <dbl>    <dbl>
## 1     1 Wii … Wii       2006 Spor… Nintendo      41.5    29.0      3.77
## 2     2 Supe… NES       1985 Plat… Nintendo      29.1     3.58     6.81
## 3     3 Mari… Wii       2008 Raci… Nintendo      15.8    12.9      3.79
## 4     4 Wii … Wii       2009 Spor… Nintendo      15.8    11.0      3.28
## 5     5 Poke… GB        1996 Role… Nintendo      11.3     8.89    10.2 
## # … with 16,322 more rows, and 2 more variables: Other_Sales <dbl>,
## #   Global_Sales <dbl>

You can utilise the & and | operators (and and or respectively) to create more complicated criteria. You can also separate your expressions with commas, which is equivalent to &:

vg_sales %>%
  dplyr::filter(Year == 1999 | Year == 2000) %>%
  print(n = 2)
## # A tibble: 687 x 11
##    Rank Name  Platform  Year Genre Publisher NA_Sales EU_Sales JP_Sales
##   <dbl> <chr> <chr>    <dbl> <chr> <chr>        <dbl>    <dbl>    <dbl>
## 1    13 Poke… GB        1999 Role… Nintendo      9        6.18     7.2 
## 2    70 Gran… PS        1999 Raci… Sony Com…     3.88     3.42     1.69
## # … with 685 more rows, and 2 more variables: Other_Sales <dbl>,
## #   Global_Sales <dbl>
vg_sales %>%
  dplyr::filter(Year == 1999, Platform == "GB") %>%
  # equivalent to dplyr::filter(Year == 1999 & Platform == "GB")
  print(n = 2)
## # A tibble: 11 x 11
##    Rank Name  Platform  Year Genre Publisher NA_Sales EU_Sales JP_Sales
##   <dbl> <chr> <chr>    <dbl> <chr> <chr>        <dbl>    <dbl>    <dbl>
## 1    13 Poke… GB        1999 Role… Nintendo      9        6.18     7.2 
## 2   172 Poke… GB        1999 Misc  Nintendo      3.02     1.12     1.01
## # … with 9 more rows, and 2 more variables: Other_Sales <dbl>,
## #   Global_Sales <dbl>

Looking at our dataset, it looks as though the data past 2010 is a bit more spotty. Let’s filtered out the years beyong 2010 for our analysis:

clean_vg_sales <- vg_sales %>%
  dplyr::filter(Year <= 2010)

8.2 Missing values

Missing values are common in data science - data collection is often imperfect and so you’ll end up with observations or data-points missing. Firstly, you need to decide what you’re going to do with those.

The easiest approach to just to remove them, and we can do that with the dplyr::filter() and the is.na() functions. If you remember in the Loading chapter, {readr} had trouble parsing some of the values in our Year column to numeric, and so gave us some NAs. Let’s remove those rows using this approach:

## To remove rows with NA in one column
clean_vg_sales <- clean_vg_sales %>% 
  dplyr::filter(!is.na(Year))
print(clean_vg_sales, n = 5)
## # A tibble: 12,441 x 11
##    Rank Name  Platform  Year Genre Publisher NA_Sales EU_Sales JP_Sales
##   <dbl> <chr> <chr>    <dbl> <chr> <chr>        <dbl>    <dbl>    <dbl>
## 1     1 Wii … Wii       2006 Spor… Nintendo      41.5    29.0      3.77
## 2     2 Supe… NES       1985 Plat… Nintendo      29.1     3.58     6.81
## 3     3 Mari… Wii       2008 Raci… Nintendo      15.8    12.9      3.79
## 4     4 Wii … Wii       2009 Spor… Nintendo      15.8    11.0      3.28
## 5     5 Poke… GB        1996 Role… Nintendo      11.3     8.89    10.2 
## # … with 12,436 more rows, and 2 more variables: Other_Sales <dbl>,
## #   Global_Sales <dbl>
## To remove rows with NA in any column
vg_sales %>% 
  dplyr::filter(dplyr::across(dplyr::everything(), ~!is.na(.x))) %>%
  print(n = 5)
## # A tibble: 16,327 x 11
##    Rank Name  Platform  Year Genre Publisher NA_Sales EU_Sales JP_Sales
##   <dbl> <chr> <chr>    <dbl> <chr> <chr>        <dbl>    <dbl>    <dbl>
## 1     1 Wii … Wii       2006 Spor… Nintendo      41.5    29.0      3.77
## 2     2 Supe… NES       1985 Plat… Nintendo      29.1     3.58     6.81
## 3     3 Mari… Wii       2008 Raci… Nintendo      15.8    12.9      3.79
## 4     4 Wii … Wii       2009 Spor… Nintendo      15.8    11.0      3.28
## 5     5 Poke… GB        1996 Role… Nintendo      11.3     8.89    10.2 
## # … with 16,322 more rows, and 2 more variables: Other_Sales <dbl>,
## #   Global_Sales <dbl>

Another approach to replace them with either the average for that column or with the closest neighbour value (this works better with time series data).

To replace with the nearest value, we can use the tidyr::fill() function. That approach wouldn’t be appropriate for our video games example so let’s use the datasets::airquality dataset instead:

head(datasets::airquality, 5)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## Fill a single column
datasets::airquality %>%
  tidyr::fill(Ozone, .direction = "downup") %>%
  head(5)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    18      NA 14.3   56     5   5
## Fill all columns
datasets::airquality %>%
  tidyr::fill(dplyr::everything(), .direction = "downup") %>%
  head(5)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    18     313 14.3   56     5   5

To replace them with the mean, we can either use a package like zoo, or we can use the tidyverse packages and our own function:

replace_with_mean <- function(x) {
  replace(x, is.na(x), mean(x, na.rm = TRUE))
}

datasets::airquality %>%
  # The mutate function creates new columns or overwrites existing ones
  dplyr::mutate(dplyr::across(dplyr::everything(), replace_with_mean)) %>%
  head(5)
##      Ozone  Solar.R Wind Temp Month Day
## 1 41.00000 190.0000  7.4   67     5   1
## 2 36.00000 118.0000  8.0   72     5   2
## 3 12.00000 149.0000 12.6   74     5   3
## 4 18.00000 313.0000 11.5   62     5   4
## 5 42.12931 185.9315 14.3   56     5   5

A word of warning here, however. If you’re doing complex modelling or very sensitive analyses, filling values like this can be misleading at best. Always think about the best approach for your specific project and what the repercussions of filling empty values might be.

8.3 Changing column types

When you import data into R, sometimes the type of the column doesn’t match what you want it to be. One way of tackling this is to define your column types when you import the data (as we looked at before), but it’s also perfectly acceptable to change the column type after the import.

Probably the most common conversion is from a character string to a date. For this example, we’re just going to use some test data:

bad_tibble <- tibble::tribble(~bad_date, ~value,
                              "2012/01/01", 100,
                              "2014/06/01", 200)

bad_tibble %>%
  dplyr::mutate(good_date = as.Date(bad_date, format = "%Y/%m/%d"))
## # A tibble: 2 x 3
##   bad_date   value good_date 
##   <chr>      <dbl> <date>    
## 1 2012/01/01   100 2012-01-01
## 2 2014/06/01   200 2014-06-01

Essentially, all you need to do is wrap your conversion function (e.g. as.Date(), as.character()) in a dplyr::mutate() call and you should be able to change your columns to whatever you need. To replace the original column, make sure you give it the same name when using dplyr::mutate():

bad_tibble %>%
  # This will replace the bad_date column rather than creating a new column
  dplyr::mutate(bad_date = as.Date(bad_date, format = "%Y/%m/%d"))
## # A tibble: 2 x 2
##   bad_date   value
##   <date>     <dbl>
## 1 2012-01-01   100
## 2 2014-06-01   200