A common task scientists need to do is summarizing or aggregating a data set. We’ll start with some basic R functions for this task and then move into some more powerful R functions from the dplyr package to group and summarize your data.
For a basic overall summary of your data set, use the summary function. Here is what that looks like when you use it with the EMP water quality data set:
# Simple overall summary statistics of entire data framesummary(df_wq)
Station Date Chla Pheophytin
Length:62 Min. :2020-01-16 Min. : 0.500 Min. :0.500
Class :character 1st Qu.:2020-10-19 1st Qu.: 1.530 1st Qu.:0.830
Mode :character Median :2021-09-13 Median : 2.515 Median :1.115
Mean :2021-08-12 Mean : 3.042 Mean :1.350
3rd Qu.:2022-05-16 3rd Qu.: 4.188 3rd Qu.:1.472
Max. :2022-12-19 Max. :16.510 Max. :6.130
TotAlkalinity DissAmmonia DissNitrateNitrite DOC
Min. : 46.40 Min. :0.05000 Min. :0.1600 Min. :0.190
1st Qu.: 77.03 1st Qu.:0.05000 1st Qu.:0.3337 1st Qu.:1.600
Median : 84.60 Median :0.06850 Median :0.6830 Median :2.400
Mean : 82.28 Mean :0.09485 Mean :1.2140 Mean :2.751
3rd Qu.: 90.95 3rd Qu.:0.11675 3rd Qu.:1.7150 3rd Qu.:3.700
Max. :116.00 Max. :0.29900 Max. :5.4700 Max. :9.500
TOC DON TotPhos DissOrthophos
Min. :0.200 Min. :0.1000 Min. :0.0820 Min. :0.0650
1st Qu.:1.500 1st Qu.:0.1900 1st Qu.:0.1190 1st Qu.:0.0930
Median :2.350 Median :0.2500 Median :0.1525 Median :0.1100
Mean :2.734 Mean :0.2827 Mean :0.2052 Mean :0.1837
3rd Qu.:3.675 3rd Qu.:0.3500 3rd Qu.:0.2893 3rd Qu.:0.2838
Max. :9.100 Max. :1.0700 Max. :0.4900 Max. :0.4740
NA's :6
TDS TSS TKN Depth
Min. : 152.0 Min. : 1.400 Min. :0.1490 Min. : 5.20
1st Qu.: 307.5 1st Qu.: 4.575 1st Qu.:0.3020 1st Qu.: 6.20
Median : 2169.0 Median : 13.100 Median :0.3950 Median :12.80
Mean : 5819.7 Mean : 26.906 Mean :0.4295 Mean :21.18
3rd Qu.:12125.0 3rd Qu.: 39.500 3rd Qu.:0.5222 3rd Qu.:37.33
Max. :15800.0 Max. :105.000 Max. :1.4400 Max. :42.00
Secchi Microcystis SpCndSurface WTSurface
Min. : 20.00 Min. :1.000 Min. : 278.0 Min. : 9.06
1st Qu.: 40.00 1st Qu.:1.000 1st Qu.: 548.8 1st Qu.:13.51
Median : 68.00 Median :1.000 Median : 3714.0 Median :19.05
Mean : 93.49 Mean :1.548 Mean : 9771.1 Mean :18.18
3rd Qu.:144.00 3rd Qu.:2.000 3rd Qu.:20329.0 3rd Qu.:22.11
Max. :340.00 Max. :4.000 Max. :25278.0 Max. :27.01
NA's :1
You can see that R provided a set of simple summary statistics (min, 25th and 75th quartiles, median, mean, max) for each column in the data frame. If you are interested in summary statistics for a single column in the data frame, you can use the data$column notation to subset your data set. If we wanted summary statistics of just chlorophyll-a, you could use:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.500 1.530 2.515 3.042 4.188 16.510
R provided the same set of simple summary statistics for just the Chla column.
Now, we’ll introduce a more powerful summarizing function from the dplyr package called summarize. We’ll start by using it to calculate some simple summary statistics. First, we’ll calculate the overall average of all chlorophyll-a data in the data set.
# Calculate overall mean of one columndf_wq%>%summarize(mean_Chla =mean(Chla))
# A tibble: 1 × 1
mean_Chla
<dbl>
1 3.04
You may have noticed that, unlike the summary function, summarize provides the summarized data in a tibble or data frame. This is useful if you intend to continue working with the data. Also note that I provided a name for the summarized data column “mean_Chla”, which is a lot like the dplyr::mutate function we learned about earlier.
You can calculate multiple values at once by providing additional arguments to summarize. For example, let’s calculate the overall averages of both chlorophyll-a and Pheophytin.
# Calculate overall mean of multiple columnsdf_wq%>%summarize(mean_Chla =mean(Chla), mean_Pheo =mean(Pheophytin))
R provides a 2-column tibble with our desired summary statistics. When using summarize, be mindful of NA values in your data columns. For example, DON has a few NA values - here is what happens when we try to calculate its mean:
Notice that it returns NA. If you want summarize to ignore or drop the NA values when making calculations, you’ll need to add the na.rm = TRUE argument to your summary function which in this case is mean.
Now, R returns the overall average DON value after ignoring the NA values.
7.1.0.1 Exercise
Now its your turn to try out these summarizing functions we just learned about.
Use summary to provide simple summary statistics for “Secchi” and “WTSurface”. HINT: Run the function on one column at a time.
Now, use summarize to calculate the overall minimum value for “Secchi”. HINT: Watch out for NA values!
Add the minimum value for “WTSurface” to the summarize function used above, assign it to an object, and print it to view the results.
Click below for the answer when you are done!
Code
# Use summary to calculate simple summary statistics for Secchisummary(df_wq$Secchi)# Use summary to calculate simple summary statistics for WTSurfacesummary(df_wq$WTSurface)# Use summarize to calculate the overall minimum value for Secchidf_wq%>%summarize(min_Secchi =min(Secchi, na.rm =TRUE))# Add the minimum value for "WTSurface" and assign it to an object called "df_wq_min"df_wq_min<-df_wq%>%summarize( min_Secchi =min(Secchi, na.rm =TRUE), min_WTSurface =min(WTSurface))# Print df_wq_min to see resultsdf_wq_min
7.2 Grouping
Calculating overall summary statistics is useful, but the real power of summarize becomes more apparent when its used in combination with the group_by function (also within the dplyr package). Using group_by with summarize allows for calculating summary statistics for groups of data within your data set. Examples include averages for each station, seasonal and annual statistics, or other combinations that you can imagine. Here is a simple example using these two functions to calculate overall average chlorophyll-a values for each station in the EMP water quality data set (D7 and P8).
# Group by Stationdf_wq%>%group_by(Station)%>%summarize(mean_Chla =mean(Chla))
# A tibble: 2 × 2
Station mean_Chla
<chr> <dbl>
1 D7 2.71
2 P8 3.37
You’ll see that now we have an additional column added to the tibble for “Station”, and the “mean_Chla” column contains the average values for each station. As with the example in the Basic Summarizing section above, you can calculate multiple values at once for each group by providing additional arguments to summarize.
# Calculate more than one summary statistic within `summarize`df_wq%>%group_by(Station)%>%summarize( min_Chla =min(Chla), mean_Chla =mean(Chla), median_Chla =median(Chla), max_Chla =max(Chla), sd_Chla =sd(Chla))
You can also group by more than one variable at a time. For example, the EMP water quality data set contains data from multiple years (2020-2022). We can calculate the same series of summary statistics for chlorophyll-a in the prior example for each station and year combination.
# First, we add a second grouping variable for year, creating a new object for# the resulting data framedf_wq_c<-df_wq%>%mutate(Year =year(Date))# Next, we calculate summary statistics for Chla for each station and year# combinationdf_wq_c%>%group_by(Station, Year)%>%summarize( min_Chla =min(Chla), mean_Chla =mean(Chla), median_Chla =median(Chla), max_Chla =max(Chla), sd_Chla =sd(Chla))
Wow, now we are starting to get somewhere with summarizing our data set. You’ll see that now we have another column added to the tibble for “Year” in addition to “Station” with the desired summary statistics for each combination in the following columns. You may also have noticed that the printout of the tibble indicates that it is still grouped by “Station”. This is because the default behavior of using summarize after group_by is to drop the last level of grouping (“Year”) in its output. It is always good practice to ungroup a data frame when you no longer need it to be grouped because you can get unintended results when using other functions on it. You can ungroup the data frame by using the ungroup function within the dplyr package.
# Always best practice to ungroup data after finished with operationdf_wq_c%>%group_by(Station, Year)%>%summarize( min_Chla =min(Chla), mean_Chla =mean(Chla), median_Chla =median(Chla), max_Chla =max(Chla), sd_Chla =sd(Chla))%>%ungroup()
Now, you see that the output data frame is no longer grouped. A useful trick is to use the .by argument within summarize to temporarily group the data frame just for the summarize operation.
# It's possible to group data within `summarize`df_wq_c%>%summarize( min_Chla =min(Chla), mean_Chla =mean(Chla), median_Chla =median(Chla), max_Chla =max(Chla), sd_Chla =sd(Chla), .by =c(Station, Year))
We won’t cover it here, but the group_by function also works with other functions in the dplyr package including mutate, filter, and arrange. There are also many other useful things you can do with the summarize function that we won’t cover in this class including using it with the across function. The across function allows for the summarize and mutate functions to apply operations across multiple columns in a data frame. Using it in combination with tidyselect functions allows for much more efficient code.
Summarize vs. Mutate
You may be wondering how summarize and mutate are different since they both do similar things. The main difference is that mutate always returns the same number of rows in the data frame, and summarize returns just one row for the specified summary function(s). summarize with group_by returns a row for each combination of grouping variables.
We note the structure here: we have metadata for Station and Date, and then each of the 18 analytes has its own columns. Therefore, our dimensions are 62 rows x 20 columns. Sometimes, however, it’s more advantageous to structure the data in a different way. For example, maybe we want all analyte names in one column with their values in another.
We can achieve this by pivoting the rows and columns. tidyverse has two handy functions for this: pivot_longer and pivot_wider.
7.3.1pivot_longer
When we gather multiple columns into two columns (one for names, another for values), we make the dataset longer; hence, pivot_longer. For this function, you need to specify:
which columns to include (cols)
what to call the name column (names_to)
what to call the value column (values_to)
Lets use this to pivot all the analytes into two columns. I could achieve this by writing out all the column names. However, a handy shortcut is to use the - operator; similar to negate, this tells the code to consider all columns but the ones included:
We see that we went from 20 columns to 4. However, we also went from 62 rows to 1116. Why 1116? The columns we did not specify (Station and Date) were used to help create a key; every row is a unique combination of Station, Date, and our new name column Analyte. There are 62 unique Date/Station combos, and each of the 18 analytes has a row for each one, leading to 1116 rows.
Lets pretend this was the original format we received the data in, and we rather have each analyte be its own column. In other words, our goal is to transform two columns into >2 columns, making the dataset wider. The best function for this, as you might suspect, is pivot_wider. Here, instead of pivoting to columns, we pivot from them, and need:
which column to take the names from (names_from)
which column to take the values from (values_from)
Note: The pivot functions are very powerful and have a large amount of flexibility. For example, you can specify multiple value columns or combine columns to create a single names column. The help documentation has useful examples to reference for cases like these.
7.3.3 Exercise
Now its your turn to try out grouping and summarizing.
Use group_by and summarize to calculate the minimum, median, and maximum values for “Secchi” for each station. HINT: Watch out for NA values!
Add “Year” as a grouping variable to the operation above to calculate summary statistics for “Secchi” for each station and year combination. Assign this as it’s own object named df_summary. HINT: Don’t forget to ungroup your output data frame!
Pivot df_summary so that the secchi statistics are rows and the years are columns. HINT: You will use both pivot longer and pivot wider
Click below for the answer when you are done!
Code
# Calculate the minimum, median, and maximum values for Secchi for each stationdf_wq%>%group_by(Station)%>%summarize( min_Secchi =min(Secchi, na.rm =TRUE), median_Secchi =median(Secchi, na.rm =TRUE), max_Secchi =max(Secchi, na.rm =TRUE))# Add "Year" as a grouping variable to the operation abovedf_summary<-df_wq%>%mutate(Year =year(Date))%>%group_by(Station, Year)%>%summarize( min_Secchi =min(Secchi, na.rm =TRUE), median_Secchi =median(Secchi, na.rm =TRUE), max_Secchi =max(Secchi, na.rm =TRUE))%>%ungroup()# Pivot the data; first make the secchi columns into rows, then the year rows into columnsdf_summary_long<-df_summary%>%pivot_longer(cols =c('min_Secchi','median_Secchi','max_Secchi'), names_to ='SecchiStat', values_to ='Value')df_summary_wide<-df_summary_long%>%pivot_wider(names_from =Year, values_from =Value)# Can also complete in one stepdf_summary_wide<-df_summary%>%pivot_longer(cols =starts_with('min_'):starts_with('max_'), names_to ='SecchiStat', values_to ='Value')%>%pivot_wider(names_from =Year, values_from =Value)