11 Mutating
Sometimes you’ll want to add new columns to your data. Most of the time, these will be calculated columns that can be created based on one or more of the other columns in the dataset. To create new columns, we use the dplyr::mutate()
function.
First off, let’s add something simple; the Sales column in our dataset is in millions. Let’s convert that to normal units. All we need to do is provide the name of our new column to the dplyr::mutate()
function and how our new column should be calculated:
%>%
tidy_vg_sales dplyr::mutate(Sales_units = Sales * 1000000) %>%
dplyr::select(Rank, Name, Platform, Year, Genre, Publisher, Sales_units) %>%
print(n = 5)
## # A tibble: 62,205 × 7
## Rank Name Platform Year Genre Publisher Sales_units
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 1 Wii Sports Wii 2006 Sports Nintendo 41490000
## 2 1 Wii Sports Wii 2006 Sports Nintendo 29020000
## 3 1 Wii Sports Wii 2006 Sports Nintendo 3770000
## 4 1 Wii Sports Wii 2006 Sports Nintendo 8460000
## 5 1 Wii Sports Wii 2006 Sports Nintendo 82740000
## # … with 62,200 more rows
If we wanted to overwrite the Sales column, we do the same but give the new column the same name:
%>%
tidy_vg_sales dplyr::mutate(Sales = Sales * 1000000) %>%
dplyr::select(Rank, Name, Platform, Year, Genre, Publisher, Sales) %>%
print(n = 5)
## # A tibble: 62,205 × 7
## Rank Name Platform Year Genre Publisher Sales
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl>
## 1 1 Wii Sports Wii 2006 Sports Nintendo 41490000
## 2 1 Wii Sports Wii 2006 Sports Nintendo 29020000
## 3 1 Wii Sports Wii 2006 Sports Nintendo 3770000
## 4 1 Wii Sports Wii 2006 Sports Nintendo 8460000
## 5 1 Wii Sports Wii 2006 Sports Nintendo 82740000
## # … with 62,200 more rows
While this is certainly a powerful tool, we’re not really changing the world here. However, we can create more complicated columns by leveraging the concept of applying our function to groups.
When our dataset is grouped, our mutate()
call will be evaluated for each permutation of the provided groups. Let’s look at an example of creating a cumulative sum for each Publisher and Country:
%>%
tidy_vg_sales dplyr::group_by(Publisher, Country) %>%
dplyr::arrange(Year) %>% # We need to order from earliest date to latest
dplyr::mutate(Cumulative_Sales = cumsum(Sales)) %>%
dplyr::filter(Publisher %in% c("Atari", "Activision")) %>% # Let's just look at Atari & Activision
print(n = 10)
## # A tibble: 5,280 × 9
## # Groups: Publisher, Country [10]
## Rank Name Platform Year Genre Publisher Country Sales Cumulative_Sales
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 259 Asteroids 2600 1980 Shoo… Atari NA 4 4
## 2 259 Asteroids 2600 1980 Shoo… Atari EU 0.26 0.26
## 3 259 Asteroids 2600 1980 Shoo… Atari JP 0 0
## 4 259 Asteroids 2600 1980 Shoo… Atari Other 0.05 0.05
## 5 259 Asteroids 2600 1980 Shoo… Atari Global 4.31 4.31
## 6 545 Missile … 2600 1980 Shoo… Atari NA 2.56 6.56
## 7 545 Missile … 2600 1980 Shoo… Atari EU 0.17 0.43
## 8 545 Missile … 2600 1980 Shoo… Atari JP 0 0
## 9 545 Missile … 2600 1980 Shoo… Atari Other 0.03 0.08
## 10 545 Missile … 2600 1980 Shoo… Atari Global 2.76 7.07
## # … with 5,270 more rows
Now we’ve got a new column showing the total number of sales for each Publisher up to each year. What {dplyr}
has done is essentially filtered the data down a specific publisher and country, created a cumulative sum of the Sales
column, and then done that for each publisher and country and stitched it back together.