12 Summarising

As we saw previously, the base video game sales dataset has nearly 17,000 entries! That’s a lot of data, and we’re not really going to able to extract or show any real insights from this data without summarising it to some degree. We’ll lose some of the granularity of the data but it allows us to understand the data a bit better and identify patterns.

Summarisation uses exactly the same grouping concept as mutatation does; by grouping the data we change the scope of each application of the function to be limited to that group.

Let’s go through an example. Let’s say that we want to get the global sales for each publisher in each year. For now, we don’t care about genre or platform. To do this, we’ll need to group by Publisher, Year and Country:

tidy_vg_sales %>%
  dplyr::group_by(Publisher, Year, Country)

And then we’ll want to sum up the Sales column to get our totals:

tidy_vg_sales %>%
  dplyr::group_by(Publisher, Year, Country) %>%
  dplyr::summarise(Total_Sales = sum(Sales), .groups = "drop") %>%
  # The .groups = 'drop' parameter will automatically remove the grouping from the dataset
  print(n = 5)

## # A tibble: 8,995 × 4
##   Publisher        Year Country Total_Sales
##   <chr>           <dbl> <chr>         <dbl>
## 1 10TACLE Studios  2006 EU             0.01
## 2 10TACLE Studios  2006 Global         0.02
## 3 10TACLE Studios  2006 JP             0   
## 4 10TACLE Studios  2006 NA             0.01
## 5 10TACLE Studios  2006 Other          0   
## # … with 8,990 more rows

Now we can see that the number of rows has reduced down to just under 9,000, and we’ve got the total global sales for each of our publishers, countries, and for each year. {dplyr} has looked through each publisher, each country, and each year that publisher has at least 1 game, and added up the total number of sales, then moved onto the next Publisher/Country/Year combination.

In this case we’ve used the sum() function, but any kind of summary function can be used:

smmrsd_vg_sales <- tidy_vg_sales %>%
  dplyr::group_by(Publisher, Year, Country) %>%
  dplyr::summarise(Total_Sales = sum(Sales),
                   Average_Sales = mean(Sales),
                   .groups = "drop")

Because dplyr::mutate() and dplyr::summarise() both use a similar syntax and both function on grouped datasets, using the correct function when starting out can be tough. The best way to remember which one to use is to ask “How many rows am I expecting back from this?”. If the answer is fewer than you’ve got now, you’ll want the summarise() function. Otherwise you’re looking at mutate().