11 Mutating

Sometimes you’ll want to add new columns to your data. Most of the time, these will be calculated columns that can be created based on one or more of the other columns in the dataset. To create new columns, we use the dplyr::mutate() function.

First off, let’s add something simple; the Sales column in our dataset is in millions. Let’s convert that to normal units. All we need to do is provide the name of our new column to the dplyr::mutate() function and how our new column should be calculated:

tidy_vg_sales %>%
  dplyr::mutate(Sales_units = Sales * 1000000) %>%
  dplyr::select(Rank, Name, Platform, Year, Genre, Publisher, Sales_units) %>%
  print(n = 5)
## # A tibble: 62,205 x 7
##    Rank Name       Platform  Year Genre  Publisher Sales_units
##   <dbl> <chr>      <chr>    <dbl> <chr>  <chr>           <dbl>
## 1     1 Wii Sports Wii       2006 Sports Nintendo     41490000
## 2     1 Wii Sports Wii       2006 Sports Nintendo     29020000
## 3     1 Wii Sports Wii       2006 Sports Nintendo      3770000
## 4     1 Wii Sports Wii       2006 Sports Nintendo      8460000
## 5     1 Wii Sports Wii       2006 Sports Nintendo     82740000
## # … with 62,200 more rows

If we wanted to overwrite the Sales column, we do the same but give the new column the same name:

tidy_vg_sales %>%
  dplyr::mutate(Sales = Sales * 1000000) %>%
  dplyr::select(Rank, Name, Platform, Year, Genre, Publisher, Sales) %>%
  print(n = 5)
## # A tibble: 62,205 x 7
##    Rank Name       Platform  Year Genre  Publisher    Sales
##   <dbl> <chr>      <chr>    <dbl> <chr>  <chr>        <dbl>
## 1     1 Wii Sports Wii       2006 Sports Nintendo  41490000
## 2     1 Wii Sports Wii       2006 Sports Nintendo  29020000
## 3     1 Wii Sports Wii       2006 Sports Nintendo   3770000
## 4     1 Wii Sports Wii       2006 Sports Nintendo   8460000
## 5     1 Wii Sports Wii       2006 Sports Nintendo  82740000
## # … with 62,200 more rows

While this is certainly a powerful tool, we’re not really changing the world here. However, we can create more complicated columns by leveraging the concept of applying our function to groups.

When our dataset is grouped, our mutate() call will be evaluated for each permutation of the provided groups. Let’s look at an example of creating a cumulative sum for each Publisher and Country:

tidy_vg_sales %>%
  dplyr::group_by(Publisher, Country) %>%
  dplyr::arrange(Year) %>% # We need to order from earliest date to latest
  dplyr::mutate(Cumulative_Sales = cumsum(Sales)) %>%
  dplyr::filter(Publisher %in% c("Atari", "Activision")) %>% # Let's just look at Atari & Activision
  print(n = 10)
## # A tibble: 5,280 x 9
## # Groups:   Publisher, Country [10]
##     Rank Name      Platform  Year Genre Publisher Country Sales Cumulative_Sales
##    <dbl> <chr>     <chr>    <dbl> <chr> <chr>     <chr>   <dbl>            <dbl>
##  1   259 Asteroids 2600      1980 Shoo… Atari     NA       4                4   
##  2   259 Asteroids 2600      1980 Shoo… Atari     EU       0.26             0.26
##  3   259 Asteroids 2600      1980 Shoo… Atari     JP       0                0   
##  4   259 Asteroids 2600      1980 Shoo… Atari     Other    0.05             0.05
##  5   259 Asteroids 2600      1980 Shoo… Atari     Global   4.31             4.31
##  6   545 Missile … 2600      1980 Shoo… Atari     NA       2.56             6.56
##  7   545 Missile … 2600      1980 Shoo… Atari     EU       0.17             0.43
##  8   545 Missile … 2600      1980 Shoo… Atari     JP       0                0   
##  9   545 Missile … 2600      1980 Shoo… Atari     Other    0.03             0.08
## 10   545 Missile … 2600      1980 Shoo… Atari     Global   2.76             7.07
## # … with 5,270 more rows

Now we’ve got a new column showing the total number of sales for each Publisher up to each year. What {dplyr} has done is essentially filtered the data down a specific publisher and country, created a cumulative sum of the Sales column, and then done that for each publisher and country and stitched it back together.