7 Summarizing Data

Author

S Perry

7.1 Basic Summarizing

A common task scientists need to do is summarizing or aggregating a data set. We’ll start with some basic R functions for this task and then move into some more powerful R functions from the dplyr package to group and summarize your data.

For a basic overall summary of your data set, use the summary function. Here is what that looks like when you use it with the EMP water quality data set:

# Simple overall summary statistics of entire data frame
summary(df_wq)

   Station               Date                 Chla          Pheophytin   
 Length:62          Min.   :2020-01-16   Min.   : 0.500   Min.   :0.500  
 Class :character   1st Qu.:2020-10-19   1st Qu.: 1.530   1st Qu.:0.830  
 Mode  :character   Median :2021-09-13   Median : 2.515   Median :1.115  
                    Mean   :2021-08-12   Mean   : 3.042   Mean   :1.350  
                    3rd Qu.:2022-05-16   3rd Qu.: 4.188   3rd Qu.:1.472  
                    Max.   :2022-12-19   Max.   :16.510   Max.   :6.130  
                                                                         
 TotAlkalinity     DissAmmonia      DissNitrateNitrite      DOC       
 Min.   : 46.40   Min.   :0.05000   Min.   :0.1600     Min.   :0.190  
 1st Qu.: 77.03   1st Qu.:0.05000   1st Qu.:0.3337     1st Qu.:1.600  
 Median : 84.60   Median :0.06850   Median :0.6830     Median :2.400  
 Mean   : 82.28   Mean   :0.09485   Mean   :1.2140     Mean   :2.751  
 3rd Qu.: 90.95   3rd Qu.:0.11675   3rd Qu.:1.7150     3rd Qu.:3.700  
 Max.   :116.00   Max.   :0.29900   Max.   :5.4700     Max.   :9.500  
                                                                      
      TOC             DON            TotPhos       DissOrthophos   
 Min.   :0.200   Min.   :0.1000   Min.   :0.0820   Min.   :0.0650  
 1st Qu.:1.500   1st Qu.:0.1900   1st Qu.:0.1190   1st Qu.:0.0930  
 Median :2.350   Median :0.2500   Median :0.1525   Median :0.1100  
 Mean   :2.734   Mean   :0.2827   Mean   :0.2052   Mean   :0.1837  
 3rd Qu.:3.675   3rd Qu.:0.3500   3rd Qu.:0.2893   3rd Qu.:0.2838  
 Max.   :9.100   Max.   :1.0700   Max.   :0.4900   Max.   :0.4740  
                 NA's   :6                                         
      TDS               TSS               TKN             Depth      
 Min.   :  152.0   Min.   :  1.400   Min.   :0.1490   Min.   : 5.20  
 1st Qu.:  307.5   1st Qu.:  4.575   1st Qu.:0.3020   1st Qu.: 6.20  
 Median : 2169.0   Median : 13.100   Median :0.3950   Median :12.80  
 Mean   : 5819.7   Mean   : 26.906   Mean   :0.4295   Mean   :21.18  
 3rd Qu.:12125.0   3rd Qu.: 39.500   3rd Qu.:0.5222   3rd Qu.:37.33  
 Max.   :15800.0   Max.   :105.000   Max.   :1.4400   Max.   :42.00  
                                                                     
     Secchi        Microcystis     SpCndSurface       WTSurface    
 Min.   : 20.00   Min.   :1.000   Min.   :  278.0   Min.   : 9.06  
 1st Qu.: 40.00   1st Qu.:1.000   1st Qu.:  548.8   1st Qu.:13.51  
 Median : 68.00   Median :1.000   Median : 3714.0   Median :19.05  
 Mean   : 93.49   Mean   :1.548   Mean   : 9771.1   Mean   :18.18  
 3rd Qu.:144.00   3rd Qu.:2.000   3rd Qu.:20329.0   3rd Qu.:22.11  
 Max.   :340.00   Max.   :4.000   Max.   :25278.0   Max.   :27.01  
 NA's   :1

You can see that R provided a set of simple summary statistics (min, 25th and 75th quartiles, median, mean, max) for each column in the data frame. If you are interested in summary statistics for a single column in the data frame, you can use the data$column notation to subset your data set. If we wanted summary statistics of just chlorophyll-a, you could use:

# Summary of one column
summary(df_wq$Chla)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.500   1.530   2.515   3.042   4.188  16.510

R provided the same set of simple summary statistics for just the Chla column.

Now, we’ll introduce a more powerful summarizing function from the dplyr package called summarize. We’ll start by using it to calculate some simple summary statistics. First, we’ll calculate the overall average of all chlorophyll-a data in the data set.

# Calculate overall mean of one column
df_wq %>%
  summarize(mean_Chla = mean(Chla))

# A tibble: 1 × 1
  mean_Chla
      <dbl>
1      3.04

You may have noticed that, unlike the summary function, summarize provides the summarized data in a tibble or data frame. This is useful if you intend to continue working with the data. Also note that I provided a name for the summarized data column “mean_Chla”, which is a lot like the dplyr::mutate function we learned about earlier.

You can calculate multiple values at once by providing additional arguments to summarize. For example, let’s calculate the overall averages of both chlorophyll-a and Pheophytin.

# Calculate overall mean of multiple columns
df_wq %>% summarize(mean_Chla = mean(Chla), mean_Pheo = mean(Pheophytin))

# A tibble: 1 × 2
  mean_Chla mean_Pheo
      <dbl>     <dbl>
1      3.04      1.35

R provides a 2-column tibble with our desired summary statistics. When using summarize, be mindful of NA values in your data columns. For example, DON has a few NA values - here is what happens when we try to calculate its mean:

df_wq %>% summarize(mean_DON = mean(DON))

# A tibble: 1 × 1
  mean_DON
     <dbl>
1       NA

Notice that it returns NA. If you want summarize to ignore or drop the NA values when making calculations, you’ll need to add the na.rm = TRUE argument to your summary function which in this case is mean.

df_wq %>% summarize(mean_DON = mean(DON, na.rm = TRUE))

# A tibble: 1 × 1
  mean_DON
     <dbl>
1    0.283

Now, R returns the overall average DON value after ignoring the NA values.

7.1.0.1 Exercise

Now its your turn to try out these summarizing functions we just learned about.

Use summary to provide simple summary statistics for “Secchi” and “WTSurface”.
HINT: Run the function on one column at a time.
Now, use summarize to calculate the overall minimum value for “Secchi”.
HINT: Watch out for NA values!
Add the minimum value for “WTSurface” to the summarize function used above, assign it to an object, and print it to view the results.

Click below for the answer when you are done!

Code

# Use summary to calculate simple summary statistics for Secchi
summary(df_wq$Secchi)

# Use summary to calculate simple summary statistics for WTSurface
summary(df_wq$WTSurface)

# Use summarize to calculate the overall minimum value for Secchi
df_wq %>% summarize(min_Secchi = min(Secchi, na.rm = TRUE))

# Add the minimum value for "WTSurface" and assign it to an object called "df_wq_min"
df_wq_min <- df_wq %>% 
  summarize(
    min_Secchi = min(Secchi, na.rm = TRUE),
    min_WTSurface = min(WTSurface)
  )

# Print df_wq_min to see results
df_wq_min

7.2 Grouping

Calculating overall summary statistics is useful, but the real power of summarize becomes more apparent when its used in combination with the group_by function (also within the dplyr package). Using group_by with summarize allows for calculating summary statistics for groups of data within your data set. Examples include averages for each station, seasonal and annual statistics, or other combinations that you can imagine. Here is a simple example using these two functions to calculate overall average chlorophyll-a values for each station in the EMP water quality data set (D7 and P8).

# Group by Station
df_wq %>% group_by(Station) %>% summarize(mean_Chla = mean(Chla))

# A tibble: 2 × 2
  Station mean_Chla
  <chr>       <dbl>
1 D7           2.71
2 P8           3.37

You’ll see that now we have an additional column added to the tibble for “Station”, and the “mean_Chla” column contains the average values for each station. As with the example in the Basic Summarizing section above, you can calculate multiple values at once for each group by providing additional arguments to summarize.

# Calculate more than one summary statistic within `summarize`
df_wq %>% 
  group_by(Station) %>% 
  summarize(
    min_Chla = min(Chla),
    mean_Chla = mean(Chla),
    median_Chla = median(Chla),
    max_Chla = max(Chla),
    sd_Chla = sd(Chla)
  )

# A tibble: 2 × 6
  Station min_Chla mean_Chla median_Chla max_Chla sd_Chla
  <chr>      <dbl>     <dbl>       <dbl>    <dbl>   <dbl>
1 D7          0.5       2.71        2.21     7.36    1.81
2 P8          0.57      3.37        2.67    16.5     2.92

You can also group by more than one variable at a time. For example, the EMP water quality data set contains data from multiple years (2020-2022). We can calculate the same series of summary statistics for chlorophyll-a in the prior example for each station and year combination.

# First, we add a second grouping variable for year, creating a new object for
  # the resulting data frame
df_wq_c <- df_wq %>% mutate(Year = year(Date))

# Next, we calculate summary statistics for Chla for each station and year
  # combination
df_wq_c %>% 
  group_by(Station, Year) %>% 
  summarize(
    min_Chla = min(Chla),
    mean_Chla = mean(Chla),
    median_Chla = median(Chla),
    max_Chla = max(Chla),
    sd_Chla = sd(Chla)
  )

# A tibble: 6 × 7
# Groups:   Station [2]
  Station  Year min_Chla mean_Chla median_Chla max_Chla sd_Chla
  <chr>   <dbl>    <dbl>     <dbl>       <dbl>    <dbl>   <dbl>
1 D7       2020     0.5       1.53        1.74     2.79   0.749
2 D7       2021     1.17      3.58        3.12     6.76   1.97 
3 D7       2022     0.67      2.88        2.51     7.36   1.87 
4 P8       2020     0.64      4.48        2.81    16.5    4.88 
5 P8       2021     1.25      3.18        3.24     5.56   1.57 
6 P8       2022     0.57      2.70        2.58     5.24   1.54

Wow, now we are starting to get somewhere with summarizing our data set. You’ll see that now we have another column added to the tibble for “Year” in addition to “Station” with the desired summary statistics for each combination in the following columns. You may also have noticed that the printout of the tibble indicates that it is still grouped by “Station”. This is because the default behavior of using summarize after group_by is to drop the last level of grouping (“Year”) in its output. It is always good practice to ungroup a data frame when you no longer need it to be grouped because you can get unintended results when using other functions on it. You can ungroup the data frame by using the ungroup function within the dplyr package.

# Always best practice to ungroup data after finished with operation
df_wq_c %>% 
  group_by(Station, Year) %>% 
  summarize(
    min_Chla = min(Chla),
    mean_Chla = mean(Chla),
    median_Chla = median(Chla),
    max_Chla = max(Chla),
    sd_Chla = sd(Chla)
  ) %>% 
  ungroup()

# A tibble: 6 × 7
  Station  Year min_Chla mean_Chla median_Chla max_Chla sd_Chla
  <chr>   <dbl>    <dbl>     <dbl>       <dbl>    <dbl>   <dbl>
1 D7       2020     0.5       1.53        1.74     2.79   0.749
2 D7       2021     1.17      3.58        3.12     6.76   1.97 
3 D7       2022     0.67      2.88        2.51     7.36   1.87 
4 P8       2020     0.64      4.48        2.81    16.5    4.88 
5 P8       2021     1.25      3.18        3.24     5.56   1.57 
6 P8       2022     0.57      2.70        2.58     5.24   1.54

Now, you see that the output data frame is no longer grouped. A useful trick is to use the .by argument within summarize to temporarily group the data frame just for the summarize operation.

# It's possible to group data within `summarize`
df_wq_c %>% 
  summarize(
    min_Chla = min(Chla),
    mean_Chla = mean(Chla),
    median_Chla = median(Chla),
    max_Chla = max(Chla),
    sd_Chla = sd(Chla),
    .by = c(Station, Year)
  )

# A tibble: 6 × 7
  Station  Year min_Chla mean_Chla median_Chla max_Chla sd_Chla
  <chr>   <dbl>    <dbl>     <dbl>       <dbl>    <dbl>   <dbl>
1 P8       2020     0.64      4.48        2.81    16.5    4.88 
2 D7       2020     0.5       1.53        1.74     2.79   0.749
3 P8       2021     1.25      3.18        3.24     5.56   1.57 
4 D7       2021     1.17      3.58        3.12     6.76   1.97 
5 D7       2022     0.67      2.88        2.51     7.36   1.87 
6 P8       2022     0.57      2.70        2.58     5.24   1.54

We won’t cover it here, but the group_by function also works with other functions in the dplyr package including mutate, filter, and arrange. There are also many other useful things you can do with the summarize function that we won’t cover in this class including using it with the across function. The across function allows for the summarize and mutate functions to apply operations across multiple columns in a data frame. Using it in combination with tidyselect functions allows for much more efficient code.

Summarize vs. Mutate

You may be wondering how summarize and mutate are different since they both do similar things. The main difference is that mutate always returns the same number of rows in the data frame, and summarize returns just one row for the specified summary function(s). summarize with group_by returns a row for each combination of grouping variables.

7.3 Pivoting

Let’s look at our column headers again:

colnames(df_wq)

 [1] "Station"            "Date"               "Chla"              
 [4] "Pheophytin"         "TotAlkalinity"      "DissAmmonia"       
 [7] "DissNitrateNitrite" "DOC"                "TOC"               
[10] "DON"                "TotPhos"            "DissOrthophos"     
[13] "TDS"                "TSS"                "TKN"               
[16] "Depth"              "Secchi"             "Microcystis"       
[19] "SpCndSurface"       "WTSurface"

We note the structure here: we have metadata for Station and Date, and then each of the 18 analytes has its own columns. Therefore, our dimensions are 62 rows x 20 columns. Sometimes, however, it’s more advantageous to structure the data in a different way. For example, maybe we want all analyte names in one column with their values in another.

We can achieve this by pivoting the rows and columns. tidyverse has two handy functions for this: pivot_longer and pivot_wider.

7.3.1 `pivot_longer`

When we gather multiple columns into two columns (one for names, another for values), we make the dataset longer; hence, pivot_longer. For this function, you need to specify:

which columns to include (cols)
what to call the name column (names_to)
what to call the value column (values_to)

Lets use this to pivot all the analytes into two columns. I could achieve this by writing out all the column names. However, a handy shortcut is to use the - operator; similar to negate, this tells the code to consider all columns but the ones included:

df_wq_long <- df_wq %>%
  pivot_longer(
    cols = -c(Station, Date),
    names_to = 'Analyte',
    values_to = 'Value'
  )

df_wq_long

# A tibble: 1,116 × 4
   Station Date       Analyte            Value
   <chr>   <date>     <chr>              <dbl>
 1 P8      2020-01-16 Chla                0.64
 2 P8      2020-01-16 Pheophytin          0.5 
 3 P8      2020-01-16 TotAlkalinity      98   
 4 P8      2020-01-16 DissAmmonia         0.15
 5 P8      2020-01-16 DissNitrateNitrite  2.8 
 6 P8      2020-01-16 DOC                 3.9 
 7 P8      2020-01-16 TOC                 4.1 
 8 P8      2020-01-16 DON                NA   
 9 P8      2020-01-16 TotPhos             0.31
10 P8      2020-01-16 DissOrthophos       0.2 
# ℹ 1,106 more rows

We can see that our data shape has changed; we have fewer columns and many more rows:

dim(df_wq_long)

[1] 1116    4

7.3.2 `pivot_wider`

We see that we went from 20 columns to 4. However, we also went from 62 rows to 1116. Why 1116? The columns we did not specify (Station and Date) were used to help create a key; every row is a unique combination of Station, Date, and our new name column Analyte. There are 62 unique Date/Station combos, and each of the 18 analytes has a row for each one, leading to 1116 rows.

Lets pretend this was the original format we received the data in, and we rather have each analyte be its own column. In other words, our goal is to transform two columns into >2 columns, making the dataset wider. The best function for this, as you might suspect, is pivot_wider. Here, instead of pivoting to columns, we pivot from them, and need:

which column to take the names from (names_from)
which column to take the values from (values_from)

df_wq_wide <- df_wq_long %>%
  pivot_wider(
    names_from = Analyte,
    values_from = Value
  )

df_wq_wide

# A tibble: 62 × 20
   Station Date        Chla Pheophytin TotAlkalinity DissAmmonia
   <chr>   <date>     <dbl>      <dbl>         <dbl>       <dbl>
 1 P8      2020-01-16  0.64       0.5             98        0.15
 2 D7      2020-01-22  0.67       0.87            82        0.21
 3 P8      2020-02-14  1.46       0.69            81        0.25
 4 D7      2020-02-20  2.15       0.5             86        0.14
 5 P8      2020-03-03  1.4        0.56            80        0.11
 6 D7      2020-03-06  1.89       1.13            93        0.22
 7 P8      2020-06-11  4.73       1.25            59        0.05
 8 D7      2020-06-17  1.74       0.89            78        0.05
 9 P8      2020-07-13  6.4        0.88            63        0.05
10 D7      2020-07-16  2.79       0.85            80        0.05
# ℹ 52 more rows
# ℹ 14 more variables: DissNitrateNitrite <dbl>, DOC <dbl>, TOC <dbl>,
#   DON <dbl>, TotPhos <dbl>, DissOrthophos <dbl>, TDS <dbl>, TSS <dbl>,
#   TKN <dbl>, Depth <dbl>, Secchi <dbl>, Microcystis <dbl>,
#   SpCndSurface <dbl>, WTSurface <dbl>

dim(df_wq_wide)

[1] 62 20

Now we’re back to our original dimensions!

Note: The pivot functions are very powerful and have a large amount of flexibility. For example, you can specify multiple value columns or combine columns to create a single names column. The help documentation has useful examples to reference for cases like these.

7.3.3 Exercise

Now its your turn to try out grouping and summarizing.

Use group_by and summarize to calculate the minimum, median, and maximum values for “Secchi” for each station.
HINT: Watch out for NA values!
Add “Year” as a grouping variable to the operation above to calculate summary statistics for “Secchi” for each station and year combination. Assign this as it’s own object named df_summary. HINT: Don’t forget to ungroup your output data frame!
Pivot df_summary so that the secchi statistics are rows and the years are columns. HINT: You will use both pivot longer and pivot wider