8 Cleaning

Once you’ve got your data into R, you’ll likely need to clean it up a bit. Here are some of the more common operations that you’ll be doing when it comes to data cleaning:

Filtering
Removing/replacing missing values
Changing column types

Let’s look at how we might do these tasks in R using our Kaggle dataset as an example.

8.1 Filtering

In some cases, you’ll only want a subset of the data that’s been provided to you. To filter the dataset down, use the dplyr::filter() function and provide your criteria:

vg_sales %>%
  dplyr::filter(Year > 200) %>%
  print(n = 5)

## # A tibble: 16,327 × 11
##    Rank Name         Platform  Year Genre   Publisher NA_Sales EU_Sales JP_Sales
##   <dbl> <chr>        <chr>    <dbl> <chr>   <chr>        <dbl>    <dbl>    <dbl>
## 1     1 Wii Sports   Wii       2006 Sports  Nintendo      41.5    29.0      3.77
## 2     2 Super Mario… NES       1985 Platfo… Nintendo      29.1     3.58     6.81
## 3     3 Mario Kart … Wii       2008 Racing  Nintendo      15.8    12.9      3.79
## 4     4 Wii Sports … Wii       2009 Sports  Nintendo      15.8    11.0      3.28
## 5     5 Pokemon Red… GB        1996 Role-P… Nintendo      11.3     8.89    10.2 
## # … with 16,322 more rows, and 2 more variables: Other_Sales <dbl>,
## #   Global_Sales <dbl>

You can utilise the & and | operators (and and or respectively) to create more complicated criteria. You can also separate your expressions with commas, which is equivalent to &:

vg_sales %>%
  dplyr::filter(Year == 1999 | Year == 2000) %>%
  print(n = 2)

## # A tibble: 687 × 11
##    Rank Name       Platform  Year Genre  Publisher    NA_Sales EU_Sales JP_Sales
##   <dbl> <chr>      <chr>    <dbl> <chr>  <chr>           <dbl>    <dbl>    <dbl>
## 1    13 Pokemon G… GB        1999 Role-… Nintendo         9        6.18     7.2 
## 2    70 Gran Turi… PS        1999 Racing Sony Comput…     3.88     3.42     1.69
## # … with 685 more rows, and 2 more variables: Other_Sales <dbl>,
## #   Global_Sales <dbl>

vg_sales %>%
  dplyr::filter(Year == 1999, Platform == "GB") %>%
  # equivalent to dplyr::filter(Year == 1999 & Platform == "GB")
  print(n = 2)

## # A tibble: 11 × 11
##    Rank Name         Platform  Year Genre   Publisher NA_Sales EU_Sales JP_Sales
##   <dbl> <chr>        <chr>    <dbl> <chr>   <chr>        <dbl>    <dbl>    <dbl>
## 1    13 Pokemon Gol… GB        1999 Role-P… Nintendo      9        6.18     7.2 
## 2   172 Pokemon Pin… GB        1999 Misc    Nintendo      3.02     1.12     1.01
## # … with 9 more rows, and 2 more variables: Other_Sales <dbl>,
## #   Global_Sales <dbl>

Looking at our dataset, it looks as though the data past 2010 is a bit more spotty. Let’s filtered out the years beyong 2010 for our analysis:

clean_vg_sales <- vg_sales %>%
  dplyr::filter(Year <= 2010)

8.2 Missing values

Missing values are common in data science - data collection is often imperfect and so you’ll end up with observations or data-points missing. Firstly, you need to decide what you’re going to do with those.

The easiest approach to just to remove them, and we can do that with the dplyr::filter() and the is.na() functions. If you remember in the Loading chapter, {readr} had trouble parsing some of the values in our Year column to numeric, and so gave us some NAs. Let’s remove those rows using this approach:

## To remove rows with NA in one column
clean_vg_sales <- clean_vg_sales %>% 
  dplyr::filter(!is.na(Year))
print(clean_vg_sales, n = 5)

## # A tibble: 12,441 × 11
##    Rank Name         Platform  Year Genre   Publisher NA_Sales EU_Sales JP_Sales
##   <dbl> <chr>        <chr>    <dbl> <chr>   <chr>        <dbl>    <dbl>    <dbl>
## 1     1 Wii Sports   Wii       2006 Sports  Nintendo      41.5    29.0      3.77
## 2     2 Super Mario… NES       1985 Platfo… Nintendo      29.1     3.58     6.81
## 3     3 Mario Kart … Wii       2008 Racing  Nintendo      15.8    12.9      3.79
## 4     4 Wii Sports … Wii       2009 Sports  Nintendo      15.8    11.0      3.28
## 5     5 Pokemon Red… GB        1996 Role-P… Nintendo      11.3     8.89    10.2 
## # … with 12,436 more rows, and 2 more variables: Other_Sales <dbl>,
## #   Global_Sales <dbl>

## To remove rows with NA in any column
vg_sales %>% 
  dplyr::filter(dplyr::across(dplyr::everything(), ~!is.na(.x))) %>%
  print(n = 5)

## Warning: data frame results in `filter()` are deprecated, use `if_any()` or
## `if_all()`.

## # A tibble: 16,327 × 11
##    Rank Name         Platform  Year Genre   Publisher NA_Sales EU_Sales JP_Sales
##   <dbl> <chr>        <chr>    <dbl> <chr>   <chr>        <dbl>    <dbl>    <dbl>
## 1     1 Wii Sports   Wii       2006 Sports  Nintendo      41.5    29.0      3.77
## 2     2 Super Mario… NES       1985 Platfo… Nintendo      29.1     3.58     6.81
## 3     3 Mario Kart … Wii       2008 Racing  Nintendo      15.8    12.9      3.79
## 4     4 Wii Sports … Wii       2009 Sports  Nintendo      15.8    11.0      3.28
## 5     5 Pokemon Red… GB        1996 Role-P… Nintendo      11.3     8.89    10.2 
## # … with 16,322 more rows, and 2 more variables: Other_Sales <dbl>,
## #   Global_Sales <dbl>

Another approach to replace them with either the average for that column or with the closest neighbour value (this works better with time series data).

To replace with the nearest value, we can use the tidyr::fill() function. That approach wouldn’t be appropriate for our video games example so let’s use the datasets::airquality dataset instead:

head(datasets::airquality, 5)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5

## Fill a single column
datasets::airquality %>%
  tidyr::fill(Ozone, .direction = "downup") %>%
  head(5)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    18      NA 14.3   56     5   5

## Fill all columns
datasets::airquality %>%
  tidyr::fill(dplyr::everything(), .direction = "downup") %>%
  head(5)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    18     313 14.3   56     5   5

To replace them with the mean, we can either use a package like zoo, or we can use the tidyverse packages and our own function:

replace_with_mean <- function(x) {
  replace(x, is.na(x), mean(x, na.rm = TRUE))
}

datasets::airquality %>%
  # The mutate function creates new columns or overwrites existing ones
  dplyr::mutate(dplyr::across(dplyr::everything(), replace_with_mean)) %>%
  head(5)

##      Ozone  Solar.R Wind Temp Month Day
## 1 41.00000 190.0000  7.4   67     5   1
## 2 36.00000 118.0000  8.0   72     5   2
## 3 12.00000 149.0000 12.6   74     5   3
## 4 18.00000 313.0000 11.5   62     5   4
## 5 42.12931 185.9315 14.3   56     5   5

A word of warning here, however. If you’re doing complex modelling or very sensitive analyses, filling values like this can be misleading at best. Always think about the best approach for your specific project and what the repercussions of filling empty values might be.

8.3 Changing column types

When you import data into R, sometimes the type of the column doesn’t match what you want it to be. One way of tackling this is to define your column types when you import the data (as we looked at before), but it’s also perfectly acceptable to change the column type after the import.

Probably the most common conversion is from a character string to a date. For this example, we’re just going to use some test data:

bad_tibble <- tibble::tribble(~bad_date, ~value,
                              "2012/01/01", 100,
                              "2014/06/01", 200)

bad_tibble %>%
  dplyr::mutate(good_date = as.Date(bad_date, format = "%Y/%m/%d"))

## # A tibble: 2 × 3
##   bad_date   value good_date 
##   <chr>      <dbl> <date>    
## 1 2012/01/01   100 2012-01-01
## 2 2014/06/01   200 2014-06-01

Essentially, all you need to do is wrap your conversion function (e.g. as.Date(), as.character()) in a dplyr::mutate() call and you should be able to change your columns to whatever you need. To replace the original column, make sure you give it the same name when using dplyr::mutate():

bad_tibble %>%
  # This will replace the bad_date column rather than creating a new column
  dplyr::mutate(bad_date = as.Date(bad_date, format = "%Y/%m/%d"))

## # A tibble: 2 × 2
##   bad_date   value
##   <date>     <dbl>
## 1 2012-01-01   100
## 2 2014-06-01   200