8 Cleaning
Once you’ve got your data into R, you’ll likely need to clean it up a bit. Here are some of the more common operations that you’ll be doing when it comes to data cleaning:
- Filtering
- Removing/replacing missing values
- Changing column types
Let’s look at how we might do these tasks in R using our Kaggle dataset as an example.
8.1 Filtering
In some cases, you’ll only want a subset of the data that’s been provided to you. To filter the dataset down, use the dplyr::filter()
function and provide your criteria:
%>%
vg_sales dplyr::filter(Year > 200) %>%
print(n = 5)
## # A tibble: 16,327 × 11
## Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 Wii Sports Wii 2006 Sports Nintendo 41.5 29.0 3.77
## 2 2 Super Mario… NES 1985 Platfo… Nintendo 29.1 3.58 6.81
## 3 3 Mario Kart … Wii 2008 Racing Nintendo 15.8 12.9 3.79
## 4 4 Wii Sports … Wii 2009 Sports Nintendo 15.8 11.0 3.28
## 5 5 Pokemon Red… GB 1996 Role-P… Nintendo 11.3 8.89 10.2
## # … with 16,322 more rows, and 2 more variables: Other_Sales <dbl>,
## # Global_Sales <dbl>
You can utilise the &
and |
operators (and
and or
respectively) to create more complicated criteria. You can also separate your expressions with commas, which is equivalent to &
:
%>%
vg_sales dplyr::filter(Year == 1999 | Year == 2000) %>%
print(n = 2)
## # A tibble: 687 × 11
## Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 13 Pokemon G… GB 1999 Role-… Nintendo 9 6.18 7.2
## 2 70 Gran Turi… PS 1999 Racing Sony Comput… 3.88 3.42 1.69
## # … with 685 more rows, and 2 more variables: Other_Sales <dbl>,
## # Global_Sales <dbl>
%>%
vg_sales dplyr::filter(Year == 1999, Platform == "GB") %>%
# equivalent to dplyr::filter(Year == 1999 & Platform == "GB")
print(n = 2)
## # A tibble: 11 × 11
## Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 13 Pokemon Gol… GB 1999 Role-P… Nintendo 9 6.18 7.2
## 2 172 Pokemon Pin… GB 1999 Misc Nintendo 3.02 1.12 1.01
## # … with 9 more rows, and 2 more variables: Other_Sales <dbl>,
## # Global_Sales <dbl>
Looking at our dataset, it looks as though the data past 2010 is a bit more spotty. Let’s filtered out the years beyong 2010 for our analysis:
vg_sales %>%
clean_vg_sales <- dplyr::filter(Year <= 2010)
8.2 Missing values
Missing values are common in data science - data collection is often imperfect and so you’ll end up with observations or data-points missing. Firstly, you need to decide what you’re going to do with those.
The easiest approach to just to remove them, and we can do that with the dplyr::filter()
and the is.na()
functions. If you remember in the Loading chapter, {readr}
had trouble parsing some of the values in our Year column to numeric, and so gave us some NAs. Let’s remove those rows using this approach:
## To remove rows with NA in one column
clean_vg_sales %>%
clean_vg_sales <- dplyr::filter(!is.na(Year))
print(clean_vg_sales, n = 5)
## # A tibble: 12,441 × 11
## Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 Wii Sports Wii 2006 Sports Nintendo 41.5 29.0 3.77
## 2 2 Super Mario… NES 1985 Platfo… Nintendo 29.1 3.58 6.81
## 3 3 Mario Kart … Wii 2008 Racing Nintendo 15.8 12.9 3.79
## 4 4 Wii Sports … Wii 2009 Sports Nintendo 15.8 11.0 3.28
## 5 5 Pokemon Red… GB 1996 Role-P… Nintendo 11.3 8.89 10.2
## # … with 12,436 more rows, and 2 more variables: Other_Sales <dbl>,
## # Global_Sales <dbl>
## To remove rows with NA in any column
%>%
vg_sales dplyr::filter(dplyr::across(dplyr::everything(), ~!is.na(.x))) %>%
print(n = 5)
## Warning: data frame results in `filter()` are deprecated, use `if_any()` or
## `if_all()`.
## # A tibble: 16,327 × 11
## Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 Wii Sports Wii 2006 Sports Nintendo 41.5 29.0 3.77
## 2 2 Super Mario… NES 1985 Platfo… Nintendo 29.1 3.58 6.81
## 3 3 Mario Kart … Wii 2008 Racing Nintendo 15.8 12.9 3.79
## 4 4 Wii Sports … Wii 2009 Sports Nintendo 15.8 11.0 3.28
## 5 5 Pokemon Red… GB 1996 Role-P… Nintendo 11.3 8.89 10.2
## # … with 16,322 more rows, and 2 more variables: Other_Sales <dbl>,
## # Global_Sales <dbl>
Another approach to replace them with either the average for that column or with the closest neighbour value (this works better with time series data).
To replace with the nearest value, we can use the tidyr::fill()
function. That approach wouldn’t be appropriate for our video games example so let’s use the datasets::airquality
dataset instead:
head(datasets::airquality, 5)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## Fill a single column
::airquality %>%
datasets tidyr::fill(Ozone, .direction = "downup") %>%
head(5)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 18 NA 14.3 56 5 5
## Fill all columns
::airquality %>%
datasets tidyr::fill(dplyr::everything(), .direction = "downup") %>%
head(5)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 18 313 14.3 56 5 5
To replace them with the mean, we can either use a package like zoo
, or we can use the tidyverse
packages and our own function:
function(x) {
replace_with_mean <-replace(x, is.na(x), mean(x, na.rm = TRUE))
}
::airquality %>%
datasets # The mutate function creates new columns or overwrites existing ones
dplyr::mutate(dplyr::across(dplyr::everything(), replace_with_mean)) %>%
head(5)
## Ozone Solar.R Wind Temp Month Day
## 1 41.00000 190.0000 7.4 67 5 1
## 2 36.00000 118.0000 8.0 72 5 2
## 3 12.00000 149.0000 12.6 74 5 3
## 4 18.00000 313.0000 11.5 62 5 4
## 5 42.12931 185.9315 14.3 56 5 5
A word of warning here, however. If you’re doing complex modelling or very sensitive analyses, filling values like this can be misleading at best. Always think about the best approach for your specific project and what the repercussions of filling empty values might be.
8.3 Changing column types
When you import data into R, sometimes the type of the column doesn’t match what you want it to be. One way of tackling this is to define your column types when you import the data (as we looked at before), but it’s also perfectly acceptable to change the column type after the import.
Probably the most common conversion is from a character string to a date. For this example, we’re just going to use some test data:
tibble::tribble(~bad_date, ~value,
bad_tibble <-"2012/01/01", 100,
"2014/06/01", 200)
%>%
bad_tibble dplyr::mutate(good_date = as.Date(bad_date, format = "%Y/%m/%d"))
## # A tibble: 2 × 3
## bad_date value good_date
## <chr> <dbl> <date>
## 1 2012/01/01 100 2012-01-01
## 2 2014/06/01 200 2014-06-01
Essentially, all you need to do is wrap your conversion function (e.g. as.Date()
, as.character()
) in a dplyr::mutate()
call and you should be able to change your columns to whatever you need. To replace the original column, make sure you give it the same name when using dplyr::mutate()
:
%>%
bad_tibble # This will replace the bad_date column rather than creating a new column
dplyr::mutate(bad_date = as.Date(bad_date, format = "%Y/%m/%d"))
## # A tibble: 2 × 2
## bad_date value
## <date> <dbl>
## 1 2012-01-01 100
## 2 2014-06-01 200