10 Grouping

Before we look at adding new columns and summarising our data, we need to understand the concept of grouping our data.

When a data frame has one or more groups, any function that’s applied (when summarising or adding a new column) is applied for each combination of those groups. So say we have a dataset like this:

print(ungrpd, n = 5)

## # A tibble: 12 × 4
##   Year  Month Group Value
##   <chr> <dbl> <chr> <dbl>
## 1 2012      1 A        10
## 2 2012      1 B        20
## 3 2012      2 A        30
## 4 2012      2 B        15
## 5 2012      3 A        25
## # … with 7 more rows

If we grouped by Year or by Month, then any function we applied would be applied separately to each value of those groups. So say we wanted get the total values for each Year, we could group by the Year column and then just sum the Value column. The same logic applies for if we wanted to get the total for each year and month; we could group by the Year and Month and sum the Value column and we’d then get a value for each distinct year and month.

To group a dataset, we use the dplyr::group_by() function:

grpd <- dplyr::group_by(ungrpd, Year)

The dplyr::group_by() function returns the same dataset but now grouped by the variables you provided. At first, it might not seem as though anything happened - if we print the dataset we still get basically the same output…

print(grpd, n = 5)

## # A tibble: 12 × 4
## # Groups:   Year [2]
##   Year  Month Group Value
##   <chr> <dbl> <chr> <dbl>
## 1 2012      1 A        10
## 2 2012      1 B        20
## 3 2012      2 A        30
## 4 2012      2 B        15
## 5 2012      3 A        25
## # … with 7 more rows

Except now we have a new entry Groups:. This tells us that the dataset has been grouped and by what variables. This means that any subsequent summarisation or mutation we do will be done relative those groups.

To test whether a dataset has been grouped, you can use the dplyr::is_grouped_df(), and to ungroup a dataset, just use dplyr::ungroup():

dplyr::is_grouped_df(grpd)

## [1] TRUE

dplyr::is_grouped_df(ungrpd)

## [1] FALSE

print(dplyr::ungroup(grpd), n = 5)

## # A tibble: 12 × 4
##   Year  Month Group Value
##   <chr> <dbl> <chr> <dbl>
## 1 2012      1 A        10
## 2 2012      1 B        20
## 3 2012      2 A        30
## 4 2012      2 B        15
## 5 2012      3 A        25
## # … with 7 more rows