8 Summarizing Data
8.1 Introduction
A common task that many of us need to do is to summarize or aggregate a data set to gain a simplified overview of the data or to more easily work with the data. Some of you may be familiar with pivot tables in Excel which can help with this task. The R programming language is also very helpful with summarizing data and allows for greater customization than pivot tables. We’ll start this lesson with some basic R functions to summarize your data, and then move into some more powerful R functions from the dplyr
package to group and summarize your data.
8.2 Import Data
We’ll start the day with loading packages and importing the EMP water quality data.
As a reminder, here is the structure of the EMP water quality data.
glimpse(df_wq)
Rows: 62
Columns: 20
$ Station <chr> "P8", "D7", "P8", "D7", "P8", "D7", "P8", "D7", "P8…
$ Date <date> 2020-01-16, 2020-01-22, 2020-02-14, 2020-02-20, 20…
$ Chla <dbl> 0.64, 0.67, 1.46, 2.15, 1.40, 1.89, 4.73, 1.74, 6.4…
$ Pheophytin <dbl> 0.50, 0.87, 0.69, 0.50, 0.56, 1.13, 1.25, 0.89, 0.8…
$ TotAlkalinity <dbl> 98.0, 82.0, 81.0, 86.0, 80.0, 93.0, 59.0, 78.0, 63.…
$ DissAmmonia <dbl> 0.150, 0.210, 0.250, 0.140, 0.110, 0.220, 0.050, 0.…
$ DissNitrateNitrite <dbl> 2.800, 0.490, 1.700, 0.480, 1.600, 0.380, 1.070, 0.…
$ DOC <dbl> 3.90, 0.27, 2.80, 0.39, 2.00, 0.19, 2.80, 1.20, 3.1…
$ TOC <dbl> 4.10, 0.32, 2.50, 0.41, 2.10, 0.20, 2.80, 1.20, 3.1…
$ DON <dbl> NA, NA, NA, NA, NA, NA, 0.30, 0.20, 0.30, 0.10, 0.5…
$ TotPhos <dbl> 0.310, 0.082, 0.130, 0.130, 0.190, 0.100, 0.188, 0.…
$ DissOrthophos <dbl> 0.200, 0.071, 0.130, 0.065, 0.140, 0.082, 0.177, 0.…
$ TDS <dbl> 380, 9500, 340, 5800, 290, 8700, 280, 7760, 227, 11…
$ TSS <dbl> 8.9, 38.0, 2.2, 18.0, 1.4, 28.0, 6.6, 35.6, 5.3, 23…
$ TKN <dbl> 0.520, 0.480, 0.430, 0.250, 0.400, 0.200, 0.400, 0.…
$ Depth <dbl> 28.9, 18.8, 39.0, 7.1, 39.0, 7.2, 37.1, 5.2, 36.7, …
$ Secchi <dbl> 116, 30, 212, 52, 340, 48, 100, 40, 160, 44, 120, 6…
$ Microcystis <dbl> 1, 1, 1, 1, 1, 1, 3, 2, 3, 2, 4, 2, 3, 2, 2, 1, 1, …
$ SpCndSurface <dbl> 667, 15532, 647, 11369, 530, 16257, 503, 12946, 404…
$ WTSurface <dbl> 9.67, 9.97, 11.09, 12.51, 13.97, 13.81, 23.46, 21.1…
8.3 Basic Summarizing
For a basic overall summary of your data set, use the summary
function. Here is what that looks like when you use it with the EMP water quality data set.
# Simple overall summary statistics of entire data frame
summary(df_wq)
Station Date Chla Pheophytin
Length:62 Min. :2020-01-16 Min. : 0.500 Min. :0.500
Class :character 1st Qu.:2020-10-19 1st Qu.: 1.530 1st Qu.:0.830
Mode :character Median :2021-09-13 Median : 2.515 Median :1.115
Mean :2021-08-12 Mean : 3.042 Mean :1.350
3rd Qu.:2022-05-16 3rd Qu.: 4.188 3rd Qu.:1.472
Max. :2022-12-19 Max. :16.510 Max. :6.130
TotAlkalinity DissAmmonia DissNitrateNitrite DOC
Min. : 46.40 Min. :0.05000 Min. :0.1600 Min. :0.190
1st Qu.: 77.03 1st Qu.:0.05000 1st Qu.:0.3337 1st Qu.:1.600
Median : 84.60 Median :0.06850 Median :0.6830 Median :2.400
Mean : 82.28 Mean :0.09485 Mean :1.2140 Mean :2.751
3rd Qu.: 90.95 3rd Qu.:0.11675 3rd Qu.:1.7150 3rd Qu.:3.700
Max. :116.00 Max. :0.29900 Max. :5.4700 Max. :9.500
TOC DON TotPhos DissOrthophos
Min. :0.200 Min. :0.1000 Min. :0.0820 Min. :0.0650
1st Qu.:1.500 1st Qu.:0.1900 1st Qu.:0.1190 1st Qu.:0.0930
Median :2.350 Median :0.2500 Median :0.1525 Median :0.1100
Mean :2.734 Mean :0.2827 Mean :0.2052 Mean :0.1837
3rd Qu.:3.675 3rd Qu.:0.3500 3rd Qu.:0.2893 3rd Qu.:0.2838
Max. :9.100 Max. :1.0700 Max. :0.4900 Max. :0.4740
NA's :6
TDS TSS TKN Depth
Min. : 152.0 Min. : 1.400 Min. :0.1490 Min. : 5.20
1st Qu.: 307.5 1st Qu.: 4.575 1st Qu.:0.3020 1st Qu.: 6.20
Median : 2169.0 Median : 13.100 Median :0.3950 Median :12.80
Mean : 5819.7 Mean : 26.906 Mean :0.4295 Mean :21.18
3rd Qu.:12125.0 3rd Qu.: 39.500 3rd Qu.:0.5222 3rd Qu.:37.33
Max. :15800.0 Max. :105.000 Max. :1.4400 Max. :42.00
Secchi Microcystis SpCndSurface WTSurface
Min. : 20.00 Min. :1.000 Min. : 278.0 Min. : 9.06
1st Qu.: 40.00 1st Qu.:1.000 1st Qu.: 548.8 1st Qu.:13.51
Median : 68.00 Median :1.000 Median : 3714.0 Median :19.05
Mean : 93.49 Mean :1.548 Mean : 9771.1 Mean :18.18
3rd Qu.:144.00 3rd Qu.:2.000 3rd Qu.:20329.0 3rd Qu.:22.11
Max. :340.00 Max. :4.000 Max. :25278.0 Max. :27.01
NA's :1
You can see that R provided a set of simple summary statistics (min, 25th and 75th quartiles, median, mean, max) for each column in the data frame. If you are interested in summary statistics for a single column in the data frame, you can use the data$column
notation to subset your data set. If we wanted summary statistics of just chlorophyll-a, you could use:
# Summary of one column
summary(df_wq$Chla)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.500 1.530 2.515 3.042 4.188 16.510
R provided the same set of simple summary statistics for just the Chla
column.
Now, we’ll introduce a more powerful summarizing function from the dplyr
package called summarize
. We’ll start by using it to calculate some simple summary statistics. First, we’ll calculate the overall average of all chlorophyll-a data in the data set.
# A tibble: 1 × 1
mean_Chla
<dbl>
1 3.04
You may have noticed that unlike the summary
function, summarize
provides the summarized data in a tibble or data frame. This is useful if you intend to continue working with the data. Also note that I provided a name for the summarized data column “mean_Chla”, which is a lot like the dplyr::mutate
function we learned about earlier.
You can calculate multiple values at once by providing additional arguments to summarize
. For example, let’s calculate the overall averages of both chlorophyll-a and Pheophytin.
# Calculate overall mean of multiple columns
df_wq %>% summarize(mean_Chla = mean(Chla), mean_Pheo = mean(Pheophytin))
# A tibble: 1 × 2
mean_Chla mean_Pheo
<dbl> <dbl>
1 3.04 1.35
R provides a 2-column tibble with our desired summary statistics. When using summarize
, be mindful of NA
values in your data columns. For example, DON has a few NA
values - here is what happens when we try to calculate its mean:
Notice that it returns NA
. If you want summarize
to ignore or drop the NA
values when making calculations, you’ll need to add the na.rm = TRUE
argument to your summary function which in this case is mean
.
Now, R returns the overall average DON value after ignoring the NA
values.
8.3.1 Exercise
Now its your turn to try out these summarizing functions we just learned about.
Use
summary
to provide simple summary statistics for “Secchi” and “WTSurface”.
HINT: Run the function on one column at a time.Now, use
summarize
to calculate the overall minimum value for “Secchi”.
HINT: Watch out forNA
values!Add the minimum value for “WTSurface” to the
summarize
function used above, assign it to an object, and print it to view the results.
Click below for the answer when you are done!
Code
# Use summary to calculate simple summary statistics for Secchi
summary(df_wq$Secchi)
# Use summary to calculate simple summary statistics for WTSurface
summary(df_wq$WTSurface)
# Use summarize to calculate the overall minimum value for Secchi
df_wq %>% summarize(min_Secchi = min(Secchi, na.rm = TRUE))
# Add the minimum value for "WTSurface" and assign it to an object called "df_wq_min"
df_wq_min <- df_wq %>%
summarize(
min_Secchi = min(Secchi, na.rm = TRUE),
min_WTSurface = min(WTSurface)
)
# Print df_wq_min to see results
df_wq_min
8.4 Grouping
Calculating overall summary statistics is useful, but the real power of summarize
becomes more apparent when its used in combination with the group_by
function (also within the dplyr
package). Using group_by
with summarize
allows for calculating summary statistics for groups of data within your data set. Examples include averages for each station, seasonal and annual statistics, or other combinations that you can imagine. Here is a simple example using these two functions to calculate overall average chlorophyll-a values for each station in the EMP water quality data set (D7 and P8).
# A tibble: 2 × 2
Station mean_Chla
<chr> <dbl>
1 D7 2.71
2 P8 3.37
You’ll see that now we have an additional column added to the tibble for “Station”, and the “mean_Chla” column contains the average values for each station. As with the example in the Basic Summarizing section above, you can calculate multiple values at once for each group by providing additional arguments to summarize
.
# Calculate more than one summary statistic within `summarize`
df_wq %>%
group_by(Station) %>%
summarize(
min_Chla = min(Chla),
mean_Chla = mean(Chla),
median_Chla = median(Chla),
max_Chla = max(Chla),
sd_Chla = sd(Chla)
)
# A tibble: 2 × 6
Station min_Chla mean_Chla median_Chla max_Chla sd_Chla
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 D7 0.5 2.71 2.21 7.36 1.81
2 P8 0.57 3.37 2.67 16.5 2.92
You can also group by more than one variable at a time. For example, the EMP water quality data set contains data from multiple years (2020-2022). We can calculate the same series of summary statistics for chlorophyll-a in the prior example for each station and year combination.
# First, we add a second grouping variable for year, creating a new object for
# the resulting data frame
df_wq_c <- df_wq %>% mutate(Year = year(Date))
# Next, we calculate summary statistics for Chla for each station and year
# combination
df_wq_c %>%
group_by(Station, Year) %>%
summarize(
min_Chla = min(Chla),
mean_Chla = mean(Chla),
median_Chla = median(Chla),
max_Chla = max(Chla),
sd_Chla = sd(Chla)
)
# A tibble: 6 × 7
# Groups: Station [2]
Station Year min_Chla mean_Chla median_Chla max_Chla sd_Chla
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 D7 2020 0.5 1.53 1.74 2.79 0.749
2 D7 2021 1.17 3.58 3.12 6.76 1.97
3 D7 2022 0.67 2.88 2.51 7.36 1.87
4 P8 2020 0.64 4.48 2.81 16.5 4.88
5 P8 2021 1.25 3.18 3.24 5.56 1.57
6 P8 2022 0.57 2.70 2.58 5.24 1.54
Wow, now we are starting to get somewhere with summarizing our data set. You’ll see that now we have another column added to the tibble for “Year” in addition to “Station” with the desired summary statistics for each combination in the following columns. You may also have noticed that the printout of the tibble indicates that it is still grouped by “Station”. This is because the default behavior of using summarize
after group_by
is to drop the last level of grouping (“Year”) in its output. It is always good practice to ungroup a data frame when you no longer need it to be grouped because you can get unintended results when using other functions on it. You can ungroup the data frame by using the ungroup
function within the dplyr
package.
# Always best practice to ungroup data after finished with operation
df_wq_c %>%
group_by(Station, Year) %>%
summarize(
min_Chla = min(Chla),
mean_Chla = mean(Chla),
median_Chla = median(Chla),
max_Chla = max(Chla),
sd_Chla = sd(Chla)
) %>%
ungroup()
# A tibble: 6 × 7
Station Year min_Chla mean_Chla median_Chla max_Chla sd_Chla
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 D7 2020 0.5 1.53 1.74 2.79 0.749
2 D7 2021 1.17 3.58 3.12 6.76 1.97
3 D7 2022 0.67 2.88 2.51 7.36 1.87
4 P8 2020 0.64 4.48 2.81 16.5 4.88
5 P8 2021 1.25 3.18 3.24 5.56 1.57
6 P8 2022 0.57 2.70 2.58 5.24 1.54
Now, you see that the output data frame is no longer grouped. A useful trick is to use the .by
argument within summarize
to temporarily group the data frame just for the summarize operation.
# It's possible to group data within `summarize`
df_wq_c %>%
summarize(
min_Chla = min(Chla),
mean_Chla = mean(Chla),
median_Chla = median(Chla),
max_Chla = max(Chla),
sd_Chla = sd(Chla),
.by = c(Station, Year)
)
# A tibble: 6 × 7
Station Year min_Chla mean_Chla median_Chla max_Chla sd_Chla
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 P8 2020 0.64 4.48 2.81 16.5 4.88
2 D7 2020 0.5 1.53 1.74 2.79 0.749
3 P8 2021 1.25 3.18 3.24 5.56 1.57
4 D7 2021 1.17 3.58 3.12 6.76 1.97
5 D7 2022 0.67 2.88 2.51 7.36 1.87
6 P8 2022 0.57 2.70 2.58 5.24 1.54
We won’t cover it here, but the group_by
function also works with other functions in the dplyr
package including mutate
, filter
, and arrange
. There are also many other useful things you can do with the summarize
function that we won’t cover in this class including using it with the across
function. The across
function allows for the summarize
and mutate
functions to apply operations across multiple columns in a data frame. Using it in combination with tidyselect functions allows for much more efficient code.
You may be wondering how summarize
and mutate
are different since they both do similar things. The main difference is that mutate
always returns the same number of rows in the data frame, and summarize
returns just one row for the specified summary function(s). summarize
with group_by
returns a row for each combination of grouping variables.
8.4.1 Exercise
Now its your turn to try out grouping and summarizing.
Use
group_by
andsummarize
to calculate the minimum, median, and maximum values for “Secchi” for each station.
HINT: Watch out forNA
values!Add “Year” as a grouping variable to the operation above to calculate summary statistics for “Secchi” for each station and year combination.
HINT: Don’t forget to ungroup your output data frame!Try the same as in #2 above, but use the
.by
argument within thesummarize
function.
Click below for the answer when you are done!
Code
# Calculate the minimum, median, and maximum values for Secchi for each station
df_wq %>%
group_by(Station) %>%
summarize(
min_Secchi = min(Secchi, na.rm = TRUE),
median_Secchi = median(Secchi, na.rm = TRUE),
max_Secchi = max(Secchi, na.rm = TRUE)
)
# Add "Year" as a grouping variable to the operation above
df_wq %>%
mutate(Year = year(Date)) %>%
group_by(Station, Year) %>%
summarize(
min_Secchi = min(Secchi, na.rm = TRUE),
median_Secchi = median(Secchi, na.rm = TRUE),
max_Secchi = max(Secchi, na.rm = TRUE)
) %>%
ungroup()
# Try the same as in the operation above, but use the `.by` argument within the `summarize` function.
df_wq %>%
mutate(Year = year(Date)) %>%
summarize(
min_Secchi = min(Secchi, na.rm = TRUE),
median_Secchi = median(Secchi, na.rm = TRUE),
max_Secchi = max(Secchi, na.rm = TRUE),
.by = c(Station, Year)
)