11  Final Course Exercise

Author

Dave Bosworth

In this final course exercise, we will combine many of the concepts we’ve learned in this class together to explore a couple new data sets. The first data set we’ll be using contains continuous water quality data for various parameters collected at the Toe Drain at Lisbon Weir (LIS) site from 2014-2019. This data file contains measurements collected every 15 minutes with a YSI multi-parameter sonde. The data file “rtm_wq_lis_2014_2018.rds” has the following columns:

  1. Start with importing the continuous water quality data for LIS. The file should be at the following path in your R project: “data/rtm_wq_lis_2014_2018.rds”
A note about .rds and .RData files

.rds and .RData files are file extensions specific to the R programming language. Both file types offer compression to reduce storage space on your computer.

.rds files store a single R object, typically a data frame, while .RData files contain multiple R objects.

In this exercise, we are importing a .rds file. You can use either the readRDS or read_rds functions to import it into your R session.

  1. After importing the LIS data, take a look at its structure to make sure all data read in properly.

  2. Notice that the DateTime column was imported as a character data class. It will need to be converted to a date-time class in order to proceed with the rest of this exercise. Do so now.

  3. After cleaning up the DateTime column in your data set, create a time-series plot of one of the water quality parameters collected at LIS.

  4. It looks like we have data from 2014-2018. We are only interested in the 2015-2016 data, so create a new data frame with just the LIS data for those two years.

  5. Now, re-create the time-series plot you made above for years 2015-2016.

  6. To check the 2015-2016 data for outliers or erroneous values, plot all continuous water quality parameters (WaterTemp, Turbidity, SpCnd, pH, DO, Chla) in time-series plots using facet_wrap to put each parameter in its own subplot.
    HINT: You will need to pivot the data frame to the long format before plotting the data in facets. The long data frame will have three columns: DateTime, a column identifying the water quality parameter, and another for the values. Also, before pivoting the data, you will need to remove the columns ending with _Qual.

  7. Some of the parameters are difficult to see clearly, so re-create the plot above giving each facet a different y-axis range.

  8. Boxplots are also nice to use to look for potential outliers in the data, so make a box and whisker plot of the 2015-2016 data using year as the categorical variable and using facet_wrap to put each parameter in its own subplot. Give each facet a different y-axis range.
    HINT: You will need to use the data frame in the long format to create this plot. Also, be sure to convert year to a factor to plot it as a categorical variable.

  9. The box and whisker plots show that there may be some possible outliers in some of the parameters, but for the most part, the data looks pretty clean in the time-series plots, so we will use it as is. It turns out that you are only interested in the chlorophyll (Chla) data for 2015-2016. Averaging this data will help smooth out any potential outliers in the data. Calculate the daily average chlorophyll values for 2015-2016 in a new data frame object to be used later.
    HINT: Use the date function from the lubridate package to extract the date from the DateTime column. Also, watch out for NA values in Chla.

  10. Now, make a time-series plot of the 2015-2016 daily average chlorophyll data. Assign this plot to a new object so it can be used later.

  11. This plot is interesting, let’s see how daily average chlorophyll values from a nearby station in the Sacramento River (SRH) compares. Import the 2014-2018 SRH daily average chlorophyll data. The data file should be at the following path in your R project: “data/dv_chla_srh_2014_2018.csv”. This data file has the following columns:

    • Station - Identifies station where data was collected. In this case “SRH”.
    • Date - Date of measurement
    • Chla_mean - Daily average chlorophyll-a fluorescence in ug/L
  12. Again, we are only interested in the 2015-2016 data from SRH, so create a new data frame with just the 2015-2016 data.

  13. Bind the daily average chlorophyll data from SRH to the daily averages from LIS so they can be plotted together.
    HINT: Watch out for possibly different column names. Also, the LIS data needs a new column for Station so it can be differentiated from SRH.

  14. Now, create time-series plots of the daily average chlorophyll values for LIS and SRH on the same plot with different colors for the stations.

  15. It turns out your supervisor really likes this plot, but would like to put it in a report, so they want it to have some nicer formatting. Remake this plot with the following customizations:

    • Make each station have a different point shape. You can choose from any of these options.
    • Change the size of the points to 2 for both stations.
    • Change the color of the time-series points and lines so LIS is a shade of blue and SRH is a shade of orange. You can use this color chart to help you choose colors.
    • Change the plot theme to theme_bw.
    • Give the plot an informative title and descriptive axis labels.

    Be sure to assign this plot to a new object so it can be exported as a png file later.

  16. Export the figure for your supervisor’s report as a png file with a width of 6 inches and height of 4 inches and a dpi of 300. You also want the 2015-2016 daily average chlorophyll data for LIS and SRH, so export that as a csv file.

  17. After thinking about the results more, you notice that LIS varies throughout the summer months in 2016. Could this have something to do with water flow at LIS? It turns out we in luck because water flow is also collected at LIS. Import 2014-2018 daily average water flow (cfs) collected at LIS. The data file should be at the following path in your R project: “data/dv_flow_lis_2014_2018.csv”. This data file has the following columns:

    • Date - Date of measurement
    • Flow - Daily average water flow in cfs
  18. Again, we are only interested in the 2015-2016 flow data from LIS, so create a new data frame with just the 2015-2016 data.

  19. Create a time-series plot of the 2015-2016 daily average flow data for LIS. Assign this plot to a new object so it can be used later.

  20. Combine the 2015-2016 chlorophyll and water flow plots for LIS into one plot arranged vertically using the patchwork package to take a look at how chlorophyll and flow vary across time.

  21. There does seem to be some relationship between chlorophyll and flow particularly during 2016 at LIS. We’ll use a scatterplot of chlorophyll and flow data to take a look at this possible relationship in a different way. To accomplish this, first you will need to join the daily average flow data for LIS to the daily average chlorophyll data and filter it to just 2016.

  22. Next, create the scatterplot of the 2016 LIS data with daily average flow on the x-axis and daily average chlorophyll on the y-axis.

  23. Bonus: Add a trend line to this scatterplot using a linear model (LM).

If you have some extra time, explore other aspects of the data on your own.

Click below for the answer when you are done!

Code
# Load packages
library(tidyverse)
library(patchwork)
library(here)

# 1) Import LIS water quality data
df_wq_lis <- readRDS(here("data/rtm_wq_lis_2014_2018.rds"))

# 2) Take a look at the structure of the imported data frame
glimpse(df_wq_lis)

# 3) Convert DateTime column from character to date-time class
df_wq_lis_c <- df_wq_lis %>% mutate(DateTime = ymd_hms(DateTime, tz = "Etc/GMT+8"))

# 4) Create a time-series plot of the water temperature data
df_wq_lis_c %>% 
  ggplot(aes(x = DateTime, y = WaterTemp)) +
  geom_point()

# 5) Create a new data frame for LIS water quality data for just 2015-2016
df_wq_lis_1516 <- df_wq_lis_c %>% filter(year(DateTime) %in% 2015:2016)

# 6) Re-create the time-series plot of water temperature for years 2015-2016
df_wq_lis_1516 %>% 
  ggplot(aes(x = DateTime, y = WaterTemp)) +
  geom_point()

# 7) Plot all continuous water quality parameters in time-series plots using
  # facet_wrap to put each parameter in its own subplot.

# Pivot the data frame to the long format before plotting the data
lis_wq_param <- c("WaterTemp", "Turbidity", "SpCnd", "DO", "pH", "Chla")

df_wq_lis_1516_l <- df_wq_lis_1516 %>% 
  select(DateTime, all_of(lis_wq_param)) %>% 
  pivot_longer(
    cols = all_of(lis_wq_param),
    names_to = "Parameter",
    values_to = "Value"
  ) %>% 
  drop_na(Value)

# Create plot
df_wq_lis_1516_l %>% 
  ggplot(aes(x = DateTime, y = Value)) +
  geom_point() +
  facet_wrap(vars(Parameter))

# 8) Re-create the plot above giving each facet a different y-axis range
df_wq_lis_1516_l %>% 
  ggplot(aes(x = DateTime, y = Value)) +
  geom_point() +
  facet_wrap(vars(Parameter), scales = "free_y")

# 9) Create a box and whisker plot of the 2015-2016 data using year as the
  # categorical variable and using facet_wrap to put each parameter in its own
  # subplot. Give each facet a different y-axis range.
df_wq_lis_1516_l %>% 
  # Create a categorical (factor) column for Year
  mutate(Year = factor(year(DateTime))) %>% 
  ggplot(aes(x = Year, y = Value)) +
  geom_boxplot() +
  facet_wrap(vars(Parameter), scales = "free_y")

# 10) Calculate daily average chlorophyll values for 2015-2016 in a new data
  # frame object to be used later
df_chla_lis_1516 <- df_wq_lis_1516 %>% 
  # Create a column for Date
  mutate(Date = date(DateTime)) %>% 
  group_by(Date) %>% 
  # Use na.rm = TRUE to ignore NA values in Chla
  summarize(Chla = mean(Chla, na.rm = TRUE))

# 11) Create a time-series plot of the 2015-2016 daily average chlorophyll data.
  # Assign this plot to a new object so it can be used later
plt_chla_lis_1516 <- df_chla_lis_1516 %>% 
  ggplot(aes(x = Date, y = Chla)) +
  geom_point() +
  geom_line()

# Print out plot
plt_chla_lis_1516

# 12) Import SRH daily average chlorophyll data
df_chla_srh <- read_csv(here("data/dv_chla_srh_2014_2018.csv"))

# 13) Create a new data frame for SRH daily average chlorophyll data for just 2015-2016
df_chla_srh_1516 <- df_chla_srh %>% filter(year(Date) %in% 2015:2016)

# 14) Bind the daily average chlorophyll data from SRH to the daily averages from LIS

# Rename Chla column in SRH data frame so it matches the LIS data frame
df_chla_srh_1516_c <- df_chla_srh_1516 %>% rename(Chla = Chla_mean)

df_chla_1516 <- df_chla_lis_1516 %>% 
  # Create a new column for Station in the LIS data frame
  mutate(Station = "LIS", .before = Date) %>% 
  bind_rows(df_chla_srh_1516_c)

# 15) Create time-series plots of the daily average chlorophyll values for LIS
  # and SRH on the same plot with different colors for the stations
df_chla_1516 %>% 
  ggplot(aes(x = Date, y = Chla, color = Station)) +
  geom_point() +
  geom_line()

# 16) Remake the plot above with the specified customizations. Assign this plot
  # to a new object so it can be exported as a png file later
plt_chla_1516 <- df_chla_1516 %>% 
  ggplot(aes(x = Date, y = Chla, color = Station)) +
  geom_point(aes(shape = Station), size = 2) +
  geom_line() +
  theme_bw() +
  scale_color_manual(values = c(LIS = "midnightblue", SRH = "darkorange3")) +
  scale_shape_manual(values = c(15, 17)) +
  labs(
    title = "Chlorophyll Comparision between LIS and SRH in 2015-2016",
    y = "Daily Average Chlorophyll (ug/L)"
  )

# Print out plot
plt_chla_1516

# 17) Export the figure for your supervisor's report as a png file. Also export
  # the 2015-2016 daily average chlorophyll data for LIS and SHR as a csv file.

# Export plot
ggsave(
  plot = plt_chla_1516,
  filename = "chla_lis_srh_comp.png",
  width = 6, 
  height = 4, 
  units = "in",
  dpi = 300
)

# Export daily average chlorophyll data
df_chla_1516 %>% write_csv("dv_chla_lis_srh_2015_2016.csv")

# 18) Import 2014-2018 daily average water flow (cfs) collected at LIS
df_q_lis <- read_csv(here("data/dv_flow_lis_2014_2018.csv"))

# 19) Create a new data frame for LIS daily average water flow for just 2015-2016
df_q_lis_1516 <- df_q_lis %>% filter(year(Date) %in% 2015:2016)

# 20) Create a time-series plot of the 2015-2016 daily average flow data for
  # LIS. Assign this plot to a new object so it can be used later
plt_q_lis_1516 <- df_q_lis_1516 %>% 
  ggplot(aes(x = Date, y = Flow)) + 
  geom_point() +
  geom_line()

# Print out plot
plt_q_lis_1516

# 21) Combine the 2015-2016 chlorophyll and water flow plots for LIS into one
  # plot arranged vertically using the patchwork package
plt_chla_lis_1516 / plt_q_lis_1516

# 22) Join the daily average flow data for LIS to the daily average chlorophyll
  # data and filter it to just 2016
df_chla_q_lis_16 <- df_chla_lis_1516 %>% 
  # Join flow data to chlorophyll data
  left_join(df_q_lis_1516, by = join_by(Date)) %>% 
  # Filter to just 2016
  filter(year(Date) == 2016) %>% 
  # Remove rows with NA values in Flow column
  drop_na(Flow)

# 23) Create a scatterplot of the 2016 LIS data with daily average flow on the
  # x-axis and daily average chlorophyll on the y-axis
df_chla_q_lis_16 %>% 
  ggplot(aes(x = Flow, y = Chla)) +
  geom_point()

# 24) Add a trend line to the scatterplot above using a linear model (LM)
df_chla_q_lis_16 %>% 
  ggplot(aes(x = Flow, y = Chla)) +
  geom_point() +
  geom_smooth(method = lm, formula = "y ~ x")