10 Grouping
Before we look at adding new columns and summarising our data, we need to understand the concept of grouping our data.
When a data frame has one or more groups, any function that’s applied (when summarising or adding a new column) is applied for each combination of those groups. So say we have a dataset like this:
print(ungrpd, n = 5)
## # A tibble: 12 × 4
## Year Month Group Value
## <chr> <dbl> <chr> <dbl>
## 1 2012 1 A 10
## 2 2012 1 B 20
## 3 2012 2 A 30
## 4 2012 2 B 15
## 5 2012 3 A 25
## # … with 7 more rows
If we grouped by Year or by Month, then any function we applied would be applied separately to each value of those groups. So say we wanted get the total values for each Year, we could group by the Year
column and then just sum the Value
column. The same logic applies for if we wanted to get the total for each year and month; we could group by the Year
and Month
and sum the Value
column and we’d then get a value for each distinct year and month.
To group a dataset, we use the dplyr::group_by()
function:
dplyr::group_by(ungrpd, Year) grpd <-
The dplyr::group_by()
function returns the same dataset but now grouped by the variables you provided. At first, it might not seem as though anything happened - if we print the dataset we still get basically the same output…
print(grpd, n = 5)
## # A tibble: 12 × 4
## # Groups: Year [2]
## Year Month Group Value
## <chr> <dbl> <chr> <dbl>
## 1 2012 1 A 10
## 2 2012 1 B 20
## 3 2012 2 A 30
## 4 2012 2 B 15
## 5 2012 3 A 25
## # … with 7 more rows
Except now we have a new entry Groups:
. This tells us that the dataset has been grouped and by what variables. This means that any subsequent summarisation or mutation we do will be done relative those groups.
To test whether a dataset has been grouped, you can use the dplyr::is_grouped_df()
, and to ungroup a dataset, just use dplyr::ungroup()
:
::is_grouped_df(grpd) dplyr
## [1] TRUE
::is_grouped_df(ungrpd) dplyr
## [1] FALSE
print(dplyr::ungroup(grpd), n = 5)
## # A tibble: 12 × 4
## Year Month Group Value
## <chr> <dbl> <chr> <dbl>
## 1 2012 1 A 10
## 2 2012 1 B 20
## 3 2012 2 A 30
## 4 2012 2 B 15
## 5 2012 3 A 25
## # … with 7 more rows