Author

Perry

Published

December 29, 2025

Analysis Example

This is a quick example of SEM analysis, covering data exploration + model creation + model interpretation.

My models will be:

\[ \text{log(Chlorophyll)} \sim \text{log(DissAmmonia)} + \text{...} + \text{RE(Region)} \]

\[ \text{log(DissAmmonia)} \sim \text{...} + \text{RE(Region)} \]

I used the monthly nutrient data for this.

Code
library(tidyverse)
library(emmeans)
library(car)
library(lme4)
library(lmerTest)
library(performance)
library(piecewiseSEM)
library(lavaan)
library(semPlot)
library(mgcv)
library(here)
library(conflicted)
theme_set(theme_bw())
options(contrasts = c('contr.sum', 'contr.poly'))

# Declare package conflict preferences
conflicts_prefer(dplyr::filter(), dplyr::lag(), lme4::lmer())
Code
# Define WQ parameters included in this example
wq_param <- c(
  "DissAmmonia",
  "DissNitrate",
  "DissNitrateNitrite",
  "DissOrthophos",
  "DON",
  "TKN",
  "TotPhos",
  "Temperature",
  "Salinity",
  "Turbidity",
  "DissolvedOxygen",
  "Chlorophyll",
  "pH",
  "Microcystis",
  "Sac_Inflow",
  "SJR_Inflow",
  "Total_Inflow"
)

df_monthly <- readRDS(here("data/processed/monthly_values.rds")) %>%
  mutate(Region = factor(Region), Date = make_date(Year, Month, 1)) %>%
  select(MonthYear, Year, Month, Date, Region, all_of(wq_param)) %>%
  arrange(Region, Date)

df_monthly_long <- df_monthly %>%
  pivot_longer(
    cols = all_of(wq_param),
    names_to = "Variable",
    values_to = "Value"
  )

Exploratory Analysis (Data)

First we investigate our individual variables.

Balance check

When you have categorical variables (like Region), it’s best to have balanced data – the same number per category.

How many samples do we have per Region?

Code
df_monthly %>%
  count(Region)
Region n
Confluence 174
North 174
SouthCentral 174
Suisun Bay 174
Suisun Marsh 174

Basically even, but how many are non-NAs for relevant variables?

Code
df_monthly_long %>%
  drop_na(Value) %>%
  count(Region, Variable) %>%
  pivot_wider(names_from = Region, values_from = n, values_fill = 0)
Variable Confluence North SouthCentral Suisun Bay Suisun Marsh
Chlorophyll 174 173 174 173 167
DON 169 170 165 166 80
DissAmmonia 170 172 173 169 83
DissNitrate 114 168 131 0 0
DissNitrateNitrite 170 170 173 169 83
DissOrthophos 170 170 173 169 83
DissolvedOxygen 174 174 174 174 174
Microcystis 143 138 144 130 130
SJR_Inflow 174 174 174 174 174
Sac_Inflow 174 174 174 174 174
Salinity 174 174 174 174 174
TKN 170 170 173 169 83
Temperature 174 174 174 174 174
TotPhos 170 170 173 169 83
Total_Inflow 174 174 174 174 174
Turbidity 174 174 174 174 174
pH 172 174 174 153 152

There are missing values for certain analytes. To group them:

balanced: Temperature, Salinity, Turbidity, DissolvedOxygen, Inflows

relatively balanced (>= 75%): Chlorophyll, pH, MVI

missing data (Suisun Marsh): DissAmmonia, DissNitrateNitrite, DissOrthophos, DON, TKN, TotPhos

missing a lot of data: DissNitrate

We can break this down by month to see if the missing data is interpolable or not:

Code
df_missing <- df_monthly_long %>%
  mutate(
    Missing = factor(
      if_else(is.na(Value), 1, 0),
      levels = c(0, 1),
      labels = c('Present', 'Missing')
    )
  )

plots <- df_missing %>%
  group_by(Variable) %>%
  group_split() %>%
  lapply(function(df) {
    ggplot(df, aes(x = Date, y = Region, fill = Missing)) +
      geom_tile(color = 'grey20', linewidth = 0.3, width = 35) +
      scale_fill_manual(
        values = c('Present' = 'lightblue', 'Missing' = 'white')
      ) +
      scale_x_date(
        name = "Month",
        date_breaks = '1 year',
        date_labels = '%Y',
        expand = expansion(mult = 0)
      ) +
      scale_y_discrete(expand = expansion(add = 0)) +
      theme_bw() +
      labs(
        title = paste0('Missing Data: ', unique(df$Variable)),
        fill = 'P/A'
      ) +
      theme(
        axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
        legend.position = "top",
        panel.grid = element_blank()
      )
  })

walk(plots, print)

Findings:

No data until 02-2017 (for Suisun Marsh): DON, DissAmmonia, DissNitrateNitrite, DissOrthophos, TKN, TotPhos

  • rest of missing data is interpolable

Lots of missing data: DissNitrate

Large gaps until 06-2015: MVI

  • all 5 month gaps between Jan-May

Missing after 01-2023 for Suisun Marsh/Suisun Bay: pH

Small interpolable gaps: Chlorophyll

No gaps: Temperature, Salinity, Turbidity, DissolvedOxygen, Inflows

Discussion

A few analytes can be used for the entire 01-2010 through 06-2024 time range:

  • Fully ready: Temperature, Salinity, Turbidity, DissolvedOxygen, Inflows

  • Possibly needs minor interpolation: Chlorophyll

A few could be used for the entire range (with minor interpolation) for all regions except Suisun Marsh. The options are:

  • cut the analysis range to begin on 02-2017

  • remove Suisun Marsh from primary analyses, keep full range

    • could run separately on narrower time frame
  • analyze with full dataset, interpret with unbalanced design in mind

pH is missing data after 01-2023 for two regions; this feels like an error to me, so we might want to double check the original data. If not, the options are similar to above:

  • cut the analysis range to end at 01-2023

  • remove Suisun Marsh/Suisun Bay from primary analyses

  • analyze with full dataset, interpret with unbalanced design in mind

MVI has multiple data gaps and some large, non-interpolable gaps. Options are similar to above. We also have chlorophyll data as a response variable to ground truth our results if analyses look off with MVI.

DissNitrate is missing large amounts of data and should be excluded.

Visualize distributions

Next, I’m removing columns I won’t be using for simplicity.

Code
df_monthly_rm <- df_monthly %>%
  select(-c(DissNitrate, DON, Microcystis, SJR_Inflow, Sac_Inflow, TKN, TotPhos))

Now I look at the shape of variable distributions to get a sense of what the data looks like.

Code
df_plt <- df_monthly_rm %>%
  pivot_longer(
    cols = DissAmmonia:Total_Inflow,
    names_to = 'Analyte',
    values_to = 'Value'
  )

ggplot(df_plt, aes(x = Value)) +
  geom_histogram(
    color = 'black',
    fill = 'lightgray'
  ) +
  facet_wrap(~Analyte, scales = 'free')

They’re pretty much all right-skewed to various degrees, except for Temperature (which almost appears bivariate) and pH, which is slightly left skewed. A few graphs, such as chlorophyll and pH, indicate outliers exist.

Visualize time series

If your data has a time component, it’s also useful to look at a time series of the analytes.

Code
# no points
for (a in unique(df_plt$Analyte)) {
  p <- df_plt %>%
    filter(Analyte == a) %>%
    ggplot(aes(x = Date, y = Value)) +
    geom_line(color = 'gray50', linewidth = 0.8, na.rm = TRUE) +
    facet_wrap(~Region, scales = 'free_y', ncol = 2) +
    labs(title = a)

  print(p)
}

Code
# with points
for (a in unique(df_plt$Analyte)) {
  p <- df_plt %>%
    filter(Analyte == a) %>%
    ggplot(aes(x = Date, y = Value)) +
    geom_line(color = 'gray50', linewidth = 0.8, na.rm = TRUE) +
    geom_point(shape = 21, na.rm = TRUE) +
    facet_wrap(~Region, scales = 'free_y', ncol = 2) +
    labs(title = a)

  print(p)
}

These are especially useful for identifying seasonal patterns within the data.

Visualize outliers

Next, I like to note extreme outliers. Even if they’re real data, they can negatively influence regressions either by influencing the coefficients or inflating the error.

If later on (eg. after model diagnostics) I decide to remove any of these, this is where I’d do it.

Code
df_plt <- df_plt %>%
  group_by(Analyte) %>%
  mutate(
    Q1  = quantile(Value, 0.25, na.rm = TRUE),
    Q3  = quantile(Value, 0.75, na.rm = TRUE),
    IQR = Q3 - Q1,
    is_outlier = Value < (Q1 - 1.5 * IQR) | Value > (Q3 + 1.5 * IQR)
  ) %>%
  ungroup()

ggplot(df_plt, aes(x = 1, y = Value)) +
  geom_jitter(
    aes(color = ifelse(is_outlier, 'Outlier', 'Normal')),
    shape = 21,
    width = 0.1,
    na.rm = TRUE
  ) +
  scale_color_manual(
    values = c('Normal' = 'lightgray', 'Outlier' = 'red')
  ) +
  geom_boxplot(
    outlier.shape = NA,
    na.rm = TRUE
  ) +
  facet_wrap(~Analyte, scales = 'free_y') +
  theme(legend.position = 'none')

There’s a few outliers, though none I would remove outright. However, further on I decided I wanted to remove a few extreme DissAmmonia, pH, Chlorophyll, and DissOrthophos values (7 total), which I’ll do here:

Code
df_monthly_rm <- df_monthly_rm %>%
  filter(!(DissAmmonia > 0.4 | pH < 6.5 | Chlorophyll > 35 | DissOrthophos < 0.01) |
           is.na(DissAmmonia) |
           is.na(pH) |
           is.na(Chlorophyll) |
           is.na(DissOrthophos))

Impute NAs (optional)

Next we’ll impute missing data using a linear approximation. If methods don’t require a continuous, equally-spaced time series, this step is optional. However, I like doing it (when feasible) to offset unbalanced design issues.

If your data has non-detects, make sure to uniformly impute those values between (slightly larger than) 0 and the reporting limit.

Code
df_monthly_imputed <- df_monthly_rm %>%
  arrange(Region, Date) %>%
  group_by(Region) %>%
  mutate(
    across(
      DissAmmonia:Total_Inflow,
      ~ is.na(.x),
      .names = 'imp_{.col}'
    ),
    across(
      DissAmmonia:Total_Inflow,
      ~ zoo::na.approx(.x, na.rm = FALSE),
      .names = '{.col}'
    )
  ) %>%
  ungroup()

values_long <- df_monthly_imputed %>%
  pivot_longer(
    cols = DissAmmonia:Total_Inflow,
    names_to = 'variable',
    values_to = 'value'
  )

flags_long <- df_monthly_imputed %>%
  pivot_longer(
    cols = starts_with('imp_'),
    names_to = 'flag_var',
    values_to = 'imp'
  ) %>%
  mutate(variable = sub('imp_', '', flag_var)) %>%
  select(Date, Region, variable, imp)

df_plot <- values_long %>%
  left_join(flags_long, by = c('Date', 'Region', 'variable'))

vars_imputed <- df_plot %>%
  group_by(variable) %>%
  summarise(any_imp = any(imp, na.rm = TRUE)) %>%
  filter(any_imp) %>%
  pull(variable)

regions <- unique(df_plot$Region)

for (r in regions) {
  p <- df_plot %>%
    filter(
      Region == r,
      variable %in% vars_imputed
    ) %>%
    ggplot(aes(Date, value)) +
      geom_line(color = 'gray40', linewidth = 0.5, na.rm = TRUE) +
      geom_point(aes(color = imp), size = 1.5, shape = 21, stroke = 0.8, na.rm = TRUE) +
      scale_color_manual(
        values = c('FALSE' = 'black', 'TRUE' = 'red'),
        labels = c('Observed', 'Imputed'),
        name = ''
      ) +
    labs(title = r) +
    facet_wrap(~ variable, scales = 'free', ncol = 2)

  print(p)
}

Code
# clean dataframe
df_monthly_imputed <- df_monthly_imputed %>%
  select(-starts_with('imp_'))

Exploratory Analysis (Relationships)

Next we can explore the relationships between our variables, primarily the response and its possible predictors, to determine what should go in our models.

Since SEMs are comprised of multiple “sub models”, this step takes a while, since it’s important to do this for every response variable. For simplicity, since log_Chlorophyll is my main response, I’ll only examine it in these examples.

Transform/modify data (optional)

If you know you’re likely to transform/modify the data, you can do it here so you have those columns for further exploratory analysis. Here I:

  • add log transforms: helps stabilize variance, fits certain data patterns

  • add AR1 lag terms: this is so the previous time point can be used in regressions. I just do this for response variables (log_Chlorophyll and (for later) log_DissAmmonia).

Code
# log data
df_monthly_log <- df_monthly_imputed %>%
  mutate(
    across(
      DissAmmonia:Total_Inflow,
      ~ log(.x),
      .names = 'log_{.col}'
    )
  )

# add lags
df_monthly_log <- df_monthly_log %>%
  arrange(Region, Date) %>%
  group_by(Region) %>%
  mutate(
    lag_logChlorophyll = lag(log_Chlorophyll),
    lag_Chlorophyll = lag(Chlorophyll),
    lag_logDissAmmonia = lag(log_DissAmmonia)
  ) %>%
  ungroup() %>%
  relocate(c(Chlorophyll, log_Chlorophyll, Date),
         .after = last_col())

Correlations

First we examine if any of our data is highly correlated (or not); high correlation between variables is an issue for causal inference models.

Code
corrplot::corrplot(
  cor(df_monthly_log %>% select(c(Chlorophyll,DissAmmonia:Total_Inflow)), use = 'pairwise.complete.obs'),
  method = 'color',
  type = 'lower',
  diag = TRUE
)

Code
corrplot::corrplot(
  cor(df_monthly_log %>% select(c(Chlorophyll,DissAmmonia:Total_Inflow)), use = 'pairwise.complete.obs'),
  method = 'number',
  number.cex = 0.6,
  type = 'lower',
  diag = TRUE
)

DissolvedOxygen and Temperature are strongly negatively correlated; DO is also correlated with DissNitrateNitrite. Temp is also slightly correlated with DissAmmonia and, more strongly, DissNitrateNitrite. DON/DissNitrateNitrite and Turbidity/Total Inflow are moderately correlated, which makes sense.

The strong DissolvedOxygen/Temperature relationship likely warrants further investigation (ie. determining their relationship/causal structure with respect to other variables).

Bivariate Plots (Continuous)

Next we graph our explanatory variables against our response variable. I like to look at the raw points (to get a true sense of the data) and interpolated splines to understand the functional relationship.

I ended up using log_Chlorophyll as my response, but kept the Chlorophyll graphs for reference.

Code
df_plt <- df_monthly_log %>%
  pivot_longer(
    cols = c(DissAmmonia:Total_Inflow, lag_Chlorophyll),
    names_to = 'Analyte',
    values_to = 'Value'
  )

df_plt_log <- df_monthly_log %>%
  pivot_longer(
    cols = c(log_DissAmmonia:log_Total_Inflow, lag_logChlorophyll),
    names_to = 'Analyte',
    values_to = 'Value'
  )

Scatterplots

Code
ggplot(df_plt) +
  geom_point(aes(Value, Chlorophyll, color = Region), shape = 21, alpha = 0.5) +
  facet_wrap(~Analyte, scales = 'free_x')

Code
ggplot(df_plt_log) +
  geom_point(aes(Value, Chlorophyll, color = Region), shape = 21, alpha = 0.5) +
  facet_wrap(~Analyte, scales = 'free_x')

Code
ggplot(df_plt) +
  geom_point(aes(Value, log_Chlorophyll, color = Region), shape = 21, alpha = 0.5) +
  facet_wrap(~Analyte, scales = 'free_x')

Code
ggplot(df_plt_log) +
  geom_point(aes(Value, log_Chlorophyll, color = Region), shape = 21, alpha = 0.5) +
  facet_wrap(~Analyte, scales = 'free_x')

Splines (Global)

Code
ggplot(df_plt) +
  geom_smooth(aes(Value, Chlorophyll)) +
  facet_wrap(~Analyte, scales = 'free_x')

Code
ggplot(df_plt_log) +
  geom_smooth(aes(Value, Chlorophyll)) +
  facet_wrap(~Analyte, scales = 'free_x')

Code
ggplot(df_plt) +
  geom_smooth(aes(Value, log_Chlorophyll)) +
  facet_wrap(~Analyte, scales = 'free_x')

Code
ggplot(df_plt_log) +
  geom_smooth(aes(Value, log_Chlorophyll)) +
  facet_wrap(~Analyte, scales = 'free_x')

From these, I decided to keep log_DissAmmonia, log_Turbidity, and Temperature in my model. I also added log_DissOrthophosphate (since I think it’s considered an important variable) and the response’s lag term (lag_logChlorophyll) to account for autocorrelation. Note that Temperature appears to have a non-linear, almost cubic relationship with log_Chlorophyll.

I’m also adding log_Salinity, which is probably wrong, but it’s useful for the SEM examples.

Bivariate (Categorical)

It’s also worth seeing how chlorophyll correlates with categorical variables, in this case Region:

Code
ggplot(df_plt) +
  geom_boxplot(aes(Region, Chlorophyll, fill = Region))

Code
ggplot(df_plt) +
  geom_violin(aes(Region, Chlorophyll, fill = Region))

South Central has the most variance, followed by Suisun Marsh. The other regions seem similarly distributed. Overall, the means are similar.

Trivariate (Cat./Cont.)

You can also visualize potential interaction effects between categorical and continuous variables (or cat:cat, though cont:cont is much more difficult; it’s easier to choose those via background knowledge+model selection). If the relationship between the response and continuous predictor appears to change based on the categorical one, the interaction may be significant.

Code
ggplot(df_plt) +
  geom_smooth(aes(Value, Chlorophyll, color = Region)) +
  facet_wrap(~Analyte, scales = 'free_x')

Code
ggplot(df_plt_log) +
  geom_smooth(aes(Value, Chlorophyll, color = Region)) +
  facet_wrap(~Analyte, scales = 'free_x')

Code
ggplot(df_plt) +
  geom_smooth(aes(Value, log_Chlorophyll, color = Region)) +
  facet_wrap(~Analyte, scales = 'free_x')

Code
ggplot(df_plt_log) +
  geom_smooth(aes(Value, log_Chlorophyll, color = Region)) +
  facet_wrap(~Analyte, scales = 'free_x')

Here, no interaction effect particularly stands out. This makes sense, since Region is more of a grouping variable (which is why we’ll likely use it as a random effect).

Individual Models

Since SEMs are a series of related models (with shared covariance), it’s worth checking each sub-model individually. For traditional SEMs, only lm (fixed effects) models can be used, though piecewiseSEM can handle other model types. Here, I used linear mixed effects models.

For simplicity, I subset the data from 02-01-2017 to 12-31-2022 to escape issues with unbalanced regional data. I also added three new variables: three basis splines for Temperature (essentially \(T^1\), \(T^2\), and \(T^3\)). This is so I can correctly model Temperature’s likely non-linear dependence on Chlorophyll; with three terms, I’m assuming it’s cubic. This is called polynomial regression.

I also remove some extreme outliers I found in the model.

Code
df_monthly_subset <- df_monthly_log %>%
  filter(Year < 2023 & Date > '2017-02-01')

df_monthly_subset <- df_monthly_subset %>%
  slice(-c(5, 337))

df_monthly_subset <- df_monthly_subset %>%
  mutate(
    Temp_poly = poly(Temperature, 3),
    Temp1 = Temp_poly[,1],
    Temp2 = Temp_poly[,2],
    Temp3 = Temp_poly[,3]
    )

# clean dataframe
df_monthly_subset <- df_monthly_subset %>%
  select(-contains('_poly'))

df_monthly_subset <- df_monthly_subset %>%
  drop_na(log_Chlorophyll, log_Salinity, log_DissAmmonia, log_DissOrthophos, log_Turbidity, lag_logChlorophyll, lag_logDissAmmonia, log_pH, log_Temperature)

Note: Model Comparisons

Usually you help select an appropriate model via model comparison procedures such as anova(mod1, mod2) and looking at statistics such as AIC/BIC/log likelihood and/or doing likelihood ratio tests (as well as using background knowledge). I didn’t here, though.

LMM - Chla

Testing/Training

Usually not needed for causal inference, mostly just doing as a quick check that the model makes sense. Will re-run with full dataset.

First I’ll create a model based on training data and see how well it predicts my testing data. Temperature looked cubic to me, so I added cubic basis splines.

Code
# -- split testing/training --
sorted_dates <- sort(unique(df_monthly_subset$Date))

cut_idx <- floor(0.8 * length(sorted_dates))

train_dates <- sorted_dates[1:cut_idx]
test_dates  <- sorted_dates[(cut_idx + 1):length(sorted_dates)]

df_train <- df_monthly_subset %>% filter(Date %in% train_dates)
df_test  <- df_monthly_subset %>% filter(Date %in% test_dates)

# -- training model --
fit_chl <- lmer(
  log_Chlorophyll ~
    log_Salinity +
    log_DissAmmonia +
    log_DissOrthophos +
    log_Turbidity +
    lag_logChlorophyll +
    Temp1 +
    Temp2 +
    Temp3 +
    (1|Region),
  data = df_train,
  na.action = na.omit
)

summary(fit_chl)
Linear mixed model fit by REML ['lmerMod']
Formula: 
log_Chlorophyll ~ log_Salinity + log_DissAmmonia + log_DissOrthophos +  
    log_Turbidity + lag_logChlorophyll + Temp1 + Temp2 + Temp3 +  
    (1 | Region)
   Data: df_train

REML criterion at convergence: 262

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.6782 -0.6343 -0.0389  0.5485  3.0145 

Random effects:
 Groups   Name        Variance Std.Dev.
 Region   (Intercept) 0.02142  0.1464  
 Residual             0.13622  0.3691  
Number of obs: 278, groups:  Region, 5

Fixed effects:
                    Estimate Std. Error t value
(Intercept)        -0.648463   0.258606  -2.508
log_Salinity        0.003173   0.022728   0.140
log_DissAmmonia    -0.118220   0.046442  -2.546
log_DissOrthophos  -0.063448   0.071647  -0.886
log_Turbidity       0.248301   0.051955   4.779
lag_logChlorophyll  0.417455   0.051700   8.075
Temp1               1.159657   0.583910   1.986
Temp2               1.106531   0.448444   2.467
Temp3               2.270153   0.462919   4.904

Correlation of Fixed Effects:
            (Intr) lg_Sln lg_DsA lg_DsO lg_Trb lg_lgC Temp1  Temp2 
log_Salinty -0.298                                                 
log_DssAmmn  0.354 -0.152                                          
lg_DssOrthp  0.635 -0.159 -0.079                                   
log_Turbdty -0.580  0.249  0.062 -0.057                            
lg_lgChlrph  0.034 -0.043  0.120  0.123 -0.158                     
Temp1       -0.032 -0.009  0.510 -0.274  0.257 -0.341              
Temp2       -0.034 -0.066 -0.037 -0.064 -0.096  0.168 -0.082       
Temp3       -0.079  0.145 -0.150  0.037  0.120 -0.180  0.053 -0.013
Code
# -- predicted model --
df_test <- df_test %>%
  drop_na(log_Chlorophyll, log_Temperature, log_DissAmmonia, log_Total_Inflow)

df_test <- df_test %>%
  mutate(pred_log_Chl = predict(fit_chl, newdata = df_test))

# -- plot --
rng <- range(c(df_test$pred_log_Chl, df_test$log_Chlorophyll), na.rm = TRUE)

ggplot(df_test, aes(x = pred_log_Chl, y = log_Chlorophyll)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = 'lm', se = FALSE, linewidth = 0.7) +
  geom_abline(intercept = 0, slope = 1, linetype = 'dashed') +
  coord_equal(xlim = rng, ylim = rng) +
  labs(
    x = 'Predicted',
    y = 'Observed',
    title = 'Predicted vs Observed'
  ) +
  theme_minimal()

Looks pretty good! It does overestimate, but it’s a consistent offset and not horribly off.

Next is model diagnostics:

Code
RMSE <- sqrt(mean((df_test$log_Chlorophyll - df_test$pred_log_Chl)^2))
MAE  <- mean(abs(df_test$log_Chlorophyll - df_test$pred_log_Chl))
bias <- mean(df_test$pred_log_Chl - df_test$log_Chlorophyll)

print(RMSE); print(MAE); print(bias)
[1] 0.4276552
[1] 0.341872
[1] 0.1401536
Code
varcomp <- as.data.frame(VarCorr(fit_chl))
var_re  <- varcomp$vcov[varcomp$grp == 'Region']
var_res <- varcomp$vcov[varcomp$grp == 'Residual']

ICC <- var_re / (var_re + var_res)

print(ICC)
[1] 0.135882
Code
plot(fit_chl)

Code
res <- residuals(fit_chl)

df_res <- df_train
df_res$resid <- res

regions <- unique(df_res$Region)
n <- length(regions)

for (r in regions) {
  tmp <- df_res$resid[df_res$Region == r]
  acf(tmp,
      main = paste(r),
      na.action = na.pass)
}

Code
infl <- check_outliers(fit_chl)
infl
OK: No outliers detected.
- Based on the following method and threshold: cook (0.929).
- For variable: (Whole model)
Code
res <- residuals(fit_chl)

qqnorm(res)
qqline(res, col = 'red', lwd = 2)

Model diagnostics look pretty good; a few minor lag spikes, and there’s some right-tail skew, but neither are particularly noteworthy. A few acf graphs do show some seasonality but it’s not significant.

Full model

Now we run the full model:

Code
fit_chl <- lmer(
  log_Chlorophyll ~
    log_Salinity +
    log_DissAmmonia +
    log_DissOrthophos +
    log_Turbidity +
    lag_logChlorophyll +
    Temp1 +
    Temp2 +
    Temp3 +
    (1|Region),
  data = df_monthly_subset,
  na.action = na.omit
)

summary(fit_chl)
Linear mixed model fit by REML ['lmerMod']
Formula: 
log_Chlorophyll ~ log_Salinity + log_DissAmmonia + log_DissOrthophos +  
    log_Turbidity + lag_logChlorophyll + Temp1 + Temp2 + Temp3 +  
    (1 | Region)
   Data: df_monthly_subset

REML criterion at convergence: 340

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.58303 -0.63664 -0.06386  0.54888  2.81484 

Random effects:
 Groups   Name        Variance Std.Dev.
 Region   (Intercept) 0.02514  0.1586  
 Residual             0.14297  0.3781  
Number of obs: 347, groups:  Region, 5

Fixed effects:
                    Estimate Std. Error t value
(Intercept)        -0.614994   0.245478  -2.505
log_Salinity       -0.002593   0.021635  -0.120
log_DissAmmonia    -0.088624   0.038233  -2.318
log_DissOrthophos  -0.037385   0.067214  -0.556
log_Turbidity       0.281013   0.049134   5.719
lag_logChlorophyll  0.399493   0.046634   8.567
Temp1               1.448116   0.517629   2.798
Temp2               0.566255   0.400806   1.413
Temp3               2.353127   0.402061   5.853

Correlation of Fixed Effects:
            (Intr) lg_Sln lg_DsA lg_DsO lg_Trb lg_lgC Temp1  Temp2 
log_Salinty -0.250                                                 
log_DssAmmn  0.310 -0.098                                          
lg_DssOrthp  0.631 -0.130 -0.150                                   
log_Turbdty -0.604  0.232  0.019 -0.080                            
lg_lgChlrph -0.004 -0.021 -0.001  0.074 -0.217                     
Temp1       -0.069  0.001  0.492 -0.290  0.263 -0.401              
Temp2       -0.023 -0.066  0.001 -0.050 -0.081  0.206 -0.063       
Temp3       -0.064  0.112 -0.143  0.057  0.119 -0.159  0.008  0.001

The coefficients are similar, which is expected.

Model diagnostics:

Code
varcomp <- as.data.frame(VarCorr(fit_chl))
var_re  <- varcomp$vcov[varcomp$grp == 'Region']
var_res <- varcomp$vcov[varcomp$grp == 'Residual']

ICC <- var_re/(var_re + var_res)

print(ICC)
[1] 0.1495563
Code
plot(fit_chl)

Code
res <- residuals(fit_chl)

df_res <- df_monthly_subset
df_res$resid <- res

regions <- unique(df_res$Region)
n <- length(regions)

for (r in regions) {
  tmp <- df_res$resid[df_res$Region == r]
  acf(tmp,
      main = paste(r),
      na.action = na.pass)
}

Code
infl <- check_outliers(fit_chl)
infl
OK: No outliers detected.
- Based on the following method and threshold: cook (0.929).
- For variable: (Whole model)
Code
vif(fit_chl)
      log_Salinity    log_DissAmmonia  log_DissOrthophos      log_Turbidity 
          1.104629           1.450255           1.127412           1.172911 
lag_logChlorophyll              Temp1              Temp2              Temp3 
          1.355640           1.865876           1.058316           1.070728 
Code
mf <- model.frame(fit_chl)
row_map <- as.numeric(rownames(mf))
res <- residuals(fit_chl)

df_resrow <- tibble(
  row = row_map,
  resid = res,
  fitted = fitted(fit_chl)
)

df_res <- df_monthly_subset %>%
  mutate(row = row_number()) %>%
  left_join(df_resrow, by = 'row')
Code
res <- residuals(fit_chl)

qqnorm(res)
qqline(res, col = 'red', lwd = 2)

Looks similar to the test model, so we’re good.

LMM - DissAmmonia

Since SEMs ideally have at least two “sub models”, I’ll throw together a quick model for DissolvedAmmonia. I didn’t bother with intense model diagnostics here, but they should definitely be done for real models.

Code
df_plt_log <- df_monthly_log %>%
  mutate(lag_logDissAmmonia = lag(log(DissAmmonia))) %>%
  pivot_longer(
    cols = c(log_DissNitrateNitrite:log_Total_Inflow, lag_logChlorophyll, lag_logDissAmmonia),
    names_to = 'Analyte',
    values_to = 'Value'
  )
Code
ggplot(df_plt) +
  geom_smooth(aes(Value, log_DissAmmonia)) +
  facet_wrap(~Analyte, scales = 'free_x') +
  theme_bw()

Code
ggplot(df_plt_log) +
  geom_smooth(aes(Value, log_DissAmmonia)) +
  facet_wrap(~Analyte, scales = 'free_x') +
  theme_bw()

Code
fit_amm <- lmer(
  log_DissAmmonia ~
    lag_logDissAmmonia +
    Temp1 +
    log_Turbidity +
    (1|Region),
  data = df_monthly_subset,
  na.action = na.omit
)

summary(fit_amm)
Linear mixed model fit by REML ['lmerMod']
Formula: log_DissAmmonia ~ lag_logDissAmmonia + Temp1 + log_Turbidity +  
    (1 | Region)
   Data: df_monthly_subset

REML criterion at convergence: 489.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.6814 -0.6246  0.0659  0.6315  3.3702 

Random effects:
 Groups   Name        Variance Std.Dev.
 Region   (Intercept) 0.06061  0.2462  
 Residual             0.22566  0.4750  
Number of obs: 347, groups:  Region, 5

Fixed effects:
                   Estimate Std. Error t value
(Intercept)        -0.96356    0.25568  -3.769
lag_logDissAmmonia  0.49271    0.04699  10.484
Temp1              -3.50570    0.55817  -6.281
log_Turbidity      -0.13655    0.05981  -2.283

Correlation of Fixed Effects:
            (Intr) lg_lDA Temp1 
lg_lgDssAmm  0.560              
Temp1        0.141  0.487       
log_Turbdty -0.747 -0.088  0.144
Code
plot(fit_amm)

SEMs

Linear Mixed Effects Model

Piecewise

Now we can fit our models into an SEM. Since they’re both LMMs, we have to use piecewiseSEM.

Code
fit_chl <- lmer(
  log_Chlorophyll ~
    log_Salinity +
    log_DissAmmonia +
    log_DissOrthophos +
    log_Turbidity +
    lag_logChlorophyll +
    Temp1 +
    Temp2 +
    Temp3 +
    (1|Region),
  data = df_monthly_subset)

fit_amm <- lmer(
  log_DissAmmonia ~
    lag_logDissAmmonia +
    Temp1 +
    log_Turbidity +
    (1|Region),
  data = df_monthly_subset)


psem_out <- psem(fit_chl, fit_amm, data = df_monthly_subset)
summary(psem_out, standardize = 'scale')

  |                                                                            
  |                                                                      |   0%

  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |======================================================================| 100%

Structural Equation Model of psem_out 

Call:
  log_Chlorophyll ~ log_Salinity + log_DissAmmonia + log_DissOrthophos + log_Turbidity + lag_logChlorophyll + Temp1 + Temp2 + Temp3
  log_DissAmmonia ~ lag_logDissAmmonia + Temp1 + log_Turbidity

    AIC
 863.279

---
Tests of directed separation:

                              Independ.Claim Test.Type       DF Crit.Value
        log_DissAmmonia ~ log_Salinity + ...      coef  38.7998     6.7305
   log_DissAmmonia ~ log_DissOrthophos + ...      coef 339.1377     2.4154
  log_DissAmmonia ~ lag_logChlorophyll + ...      coef 341.9718     0.8992
               log_DissAmmonia ~ Temp2 + ...      coef 341.1981     0.7206
               log_DissAmmonia ~ Temp3 + ...      coef 338.9824     5.8138
  log_Chlorophyll ~ lag_logDissAmmonia + ...      coef 336.4220     5.0180
  P.Value  
   0.0133 *
   0.1211  
   0.3437  
   0.3966  
   0.0164 *
   0.0257 *

--
Global goodness-of-fit:

Chi-Squared = 6.016 with P-value = 0.421 and on 6 degrees of freedom
Fisher's C = 32.385 with P-value = 0.001 and on 12 degrees of freedom

---
Coefficients:

         Response          Predictor Estimate Std.Error       DF Crit.Value
  log_Chlorophyll       log_Salinity  -0.0026    0.0216  55.3931     0.0121
  log_Chlorophyll    log_DissAmmonia  -0.0886    0.0382 336.6293     5.2865
  log_Chlorophyll  log_DissOrthophos  -0.0374    0.0672 306.3545     0.2959
  log_Chlorophyll      log_Turbidity   0.2810    0.0491  69.9676    28.1085
  log_Chlorophyll lag_logChlorophyll   0.3995    0.0466 334.4428    71.6796
  log_Chlorophyll              Temp1   1.4481    0.5176 332.8132     7.6704
  log_Chlorophyll              Temp2   0.5663    0.4008 337.1051     1.9700
  log_Chlorophyll              Temp3   2.3531    0.4021 335.1774    34.1892
  log_DissAmmonia lag_logDissAmmonia   0.4927    0.0470 342.8033   108.9828
  log_DissAmmonia              Temp1  -3.5057    0.5582 342.6721    39.1401
  log_DissAmmonia      log_Turbidity  -0.1365    0.0598 232.7812     4.8258
  P.Value Std.Estimate    
   0.9129      -0.0080    
   0.0221      -0.1149   *
   0.5868      -0.0260    
   0.0000       0.3658 ***
   0.0000       0.4042 ***
   0.0059       0.1433  **
   0.1614       0.0560    
   0.0000       0.2328 ***
   0.0000       0.4879 ***
   0.0000      -0.2676 ***
   0.0290      -0.1371   *

  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05

---
Individual R-squared:

         Response method Marginal Conditional
  log_Chlorophyll   none     0.43        0.51
  log_DissAmmonia   none     0.40        0.53

There seems to be a bug, since it says it can’t test things that show up in the subsequent dataframes. Nevertheless, we note:

  • AIC: this is only meaningful compared to other models (with the same RE structure)
    • (relatively) smaller the better
  • Chi-squared (Goodness of Fit): not significant, which means our model fits the observed data well
  • Fisher’s C: significant, there’s some missing paths in our causal structure
    • we can look at the independence claims (first df) to see which (eg. log_DissAmmonia and log_Salinity)
  • the second dataframe gives the coefficients (standardized)
  • the third dataframe gives the \(R^2\) values
    • marginal are without the random effects while the conditional ones are

Also note that the coefficients can be interpreted (1 SD change in x corresponds to \(\beta\) SD change in y, conditional on other variables) except for the three Temperature ones. Together, they define a nonlinear chl–temperature curve, which is not constant (but can be graphed).

Linear Fixed Effects Model

Piecewise

For comparison, we can run the same model without the random effect; this makes it comparable to traditional SEM.

Code
fit_chl <- lm(
  log_Chlorophyll ~
    log_Salinity +
    log_DissAmmonia +
    log_DissOrthophos +
    log_Turbidity +
    lag_logChlorophyll +
    Temp1 +
    Temp2 +
    Temp3,
  data = df_monthly_subset)

fit_amm <- lm(
  log_DissAmmonia ~
    lag_logDissAmmonia +
    Temp1 +
    log_DissOrthophos +
    log_Turbidity,
    data = df_monthly_subset)

psem_out <- psem(fit_chl, fit_amm, data = df_monthly_subset)
summary(psem_out, standardize = 'scale')

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |======================================================================| 100%

Structural Equation Model of psem_out 

Call:
  log_Chlorophyll ~ log_Salinity + log_DissAmmonia + log_DissOrthophos + log_Turbidity + lag_logChlorophyll + Temp1 + Temp2 + Temp3
  log_DissAmmonia ~ lag_logDissAmmonia + Temp1 + log_DissOrthophos + log_Turbidity

    AIC
 848.324

---
Tests of directed separation:

                              Independ.Claim Test.Type  DF Crit.Value P.Value
        log_DissAmmonia ~ log_Salinity + ...      coef 341     3.8375  0.0001
  log_DissAmmonia ~ lag_logChlorophyll + ...      coef 341     0.4192  0.6753
               log_DissAmmonia ~ Temp2 + ...      coef 341    -1.8023  0.0724
               log_DissAmmonia ~ Temp3 + ...      coef 341     1.9241  0.0552
  log_Chlorophyll ~ lag_logDissAmmonia + ...      coef 337     1.8446  0.0660
     
  ***
     
     
     
     

--
Global goodness-of-fit:

Chi-Squared = 25.971 with P-value = 0 and on 5 degrees of freedom
Fisher's C = 34.904 with P-value = 0 and on 10 degrees of freedom

---
Coefficients:

         Response          Predictor Estimate Std.Error  DF Crit.Value P.Value
  log_Chlorophyll       log_Salinity  -0.0083    0.0145 338    -0.5715  0.5680
  log_Chlorophyll    log_DissAmmonia  -0.1177    0.0382 338    -3.0803  0.0022
  log_Chlorophyll  log_DissOrthophos   0.0260    0.0604 338     0.4306  0.6670
  log_Chlorophyll      log_Turbidity   0.2212    0.0363 338     6.0935  0.0000
  log_Chlorophyll lag_logChlorophyll   0.5050    0.0442 338    11.4285  0.0000
  log_Chlorophyll              Temp1   0.7501    0.5149 338     1.4566  0.1462
  log_Chlorophyll              Temp2   0.9254    0.4076 338     2.2702  0.0238
  log_Chlorophyll              Temp3   2.4551    0.4164 338     5.8967  0.0000
  log_DissAmmonia lag_logDissAmmonia   0.5352    0.0478 342    11.1842  0.0000
  log_DissAmmonia              Temp1  -3.1621    0.5874 342    -5.3832  0.0000
  log_DissAmmonia  log_DissOrthophos   0.1461    0.0740 342     1.9753  0.0490
  log_DissAmmonia      log_Turbidity   0.0796    0.0411 342     1.9369  0.0536
  Std.Estimate    
       -0.0256    
       -0.1525  **
        0.0180    
        0.2879 ***
        0.5110 ***
        0.0742    
        0.0916   *
        0.2429 ***
        0.5300 ***
       -0.2414 ***
        0.0783   *
        0.0800    

  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05

---
Individual R-squared:

         Response method R.squared
  log_Chlorophyll   none      0.49
  log_DissAmmonia   none      0.52

Traditional

Code
model <- '
    log_Chlorophyll ~ log_Salinity +
    log_DissAmmonia +
    log_DissOrthophos +
    log_Turbidity +
    lag_logChlorophyll +
    Temp1 +
    Temp2 +
    Temp3

    log_DissAmmonia ~ lag_logDissAmmonia +
    Temp1 +
    log_DissOrthophos +
    log_Turbidity
'

fit <- sem(model, data = df_monthly_subset)

summary(fit, fit.measures = TRUE, standardized = TRUE)
lavaan 0.6-20 ended normally after 2 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        14

  Number of observations                           347

Model Test User Model:
                                                      
  Test statistic                                25.971
  Degrees of freedom                                 5
  P-value (Chi-square)                           0.000

Model Test Baseline Model:

  Test statistic                               510.446
  Degrees of freedom                                17
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.958
  Tucker-Lewis Index (TLI)                       0.856

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)               -408.162
  Loglikelihood unrestricted model (H1)       -395.177
                                                      
  Akaike (AIC)                                 844.324
  Bayesian (BIC)                               898.214
  Sample-size adjusted Bayesian (SABIC)        853.802

Root Mean Square Error of Approximation:

  RMSEA                                          0.110
  90 Percent confidence interval - lower         0.071
  90 Percent confidence interval - upper         0.153
  P-value H_0: RMSEA <= 0.050                    0.008
  P-value H_0: RMSEA >= 0.080                    0.899

Standardized Root Mean Square Residual:

  SRMR                                           0.023

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  log_Chlorophyll ~                                                      
    log_Salinity      -0.008    0.014   -0.591    0.554   -0.008   -0.026
    log_DissAmmoni    -0.118    0.037   -3.205    0.001   -0.118   -0.152
    log_DssOrthphs     0.026    0.059    0.440    0.660    0.026    0.018
    log_Turbidity      0.221    0.036    6.092    0.000    0.221    0.287
    lg_lgChlrphyll     0.505    0.044   11.585    0.000    0.505    0.510
    Temp1              0.750    0.503    1.491    0.136    0.750    0.074
    Temp2              0.925    0.402    2.302    0.021    0.925    0.091
    Temp3              2.455    0.408    6.019    0.000    2.455    0.242
  log_DissAmmonia ~                                                      
    lag_logDssAmmn     0.535    0.048   11.266    0.000    0.535    0.530
    Temp1             -3.162    0.583   -5.422    0.000   -3.162   -0.241
    log_DssOrthphs     0.146    0.073    1.990    0.047    0.146    0.078
    log_Turbidity      0.080    0.041    1.951    0.051    0.080    0.080

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .log_Chlorphyll    0.151    0.011   13.172    0.000    0.151    0.512
   .log_DissAmmoni    0.238    0.018   13.172    0.000    0.238    0.481
Code
modindices(fit, sort = TRUE)
lhs op rhs mi epc sepc.lv sepc.all sepc.nox
54 log_DissAmmonia ~ log_Salinity 14.3653045 0.0661755 0.0661755 0.1576313 0.0941093
68 log_DissOrthophos ~ log_DissAmmonia 4.5340375 0.2662180 0.2662180 0.4969079 0.4969079
57 log_DissAmmonia ~ Temp3 3.7267608 0.9639944 0.9639944 0.0735945 1.3709130
59 log_Salinity ~ log_DissAmmonia 3.4500263 0.1463177 0.1463177 0.0614259 0.0614259
51 log_Chlorophyll ~~ log_DissAmmonia 3.4289810 -0.0366593 -0.0366593 -0.1931149 -0.1931149
52 log_Chlorophyll ~ lag_logDissAmmonia 3.4289810 0.0824490 0.0824490 0.1055732 0.1516052
56 log_DissAmmonia ~ Temp2 3.2743916 -0.8991260 -0.8991260 -0.0686422 -1.2786625
58 log_Salinity ~ log_Chlorophyll 2.5503856 -0.8101295 -0.8101295 -0.2630362 -0.2630362
67 log_DissOrthophos ~ log_Chlorophyll 1.8942228 -0.2135538 -0.2135538 -0.3082850 -0.3082850
94 Temp1 ~ log_Chlorophyll 1.5926568 0.0129093 0.0129093 0.1307791 0.1307791
77 log_Turbidity ~ log_DissAmmonia 1.5157448 -0.0712911 -0.0712911 -0.0709825 -0.0709825
113 Temp3 ~ log_DissAmmonia 1.1365886 0.0028511 0.0028511 0.0373452 0.0373452
112 Temp3 ~ log_Chlorophyll 0.8568775 -0.0157100 -0.0157100 -0.1591520 -0.1591520
121 lag_logDissAmmonia ~ log_Chlorophyll 0.6779539 0.0318754 0.0318754 0.0248935 0.0248935
104 Temp2 ~ log_DissAmmonia 0.4806518 -0.0019021 -0.0019021 -0.0249153 -0.0249153
76 log_Turbidity ~ log_Chlorophyll 0.3055162 -0.0714940 -0.0714940 -0.0550544 -0.0550544
122 lag_logDissAmmonia ~ log_DissAmmonia 0.2053313 -0.1007060 -0.1007060 -0.1016905 -0.1016905
85 lag_logChlorophyll ~ log_Chlorophyll 0.1814990 0.0704217 0.0704217 0.0697551 0.0697551
53 log_DissAmmonia ~ log_Chlorophyll 0.1789647 -0.0307024 -0.0307024 -0.0237453 -0.0237453
55 log_DissAmmonia ~ lag_logChlorophyll 0.1787199 0.0224721 0.0224721 0.0175461 0.0319580
95 Temp1 ~ log_DissAmmonia 0.1355445 -0.0023122 -0.0023122 -0.0302873 -0.0302873
103 Temp2 ~ log_Chlorophyll 0.1099250 0.0057720 0.0057720 0.0584736 0.0584736
86 lag_logChlorophyll ~ log_DissAmmonia 0.0058513 -0.0018456 -0.0018456 -0.0023637 -0.0023637
20 log_Salinity ~~ Temp2 0.0000000 0.0000000 0.0000000 NA 0.0000000
45 Temp2 ~~ Temp2 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
21 log_Salinity ~~ Temp3 0.0000000 0.0000000 0.0000000 NA 0.0000000
48 Temp3 ~~ Temp3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
39 lag_logChlorophyll ~~ Temp3 0.0000000 0.0000000 0.0000000 NA 0.0000000
78 log_Turbidity ~ log_Salinity 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
119 Temp3 ~ Temp2 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
32 log_Turbidity ~~ Temp1 0.0000000 0.0000000 0.0000000 NA 0.0000000
69 log_DissOrthophos ~ log_Salinity 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
23 log_DissOrthophos ~~ log_DissOrthophos 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
30 log_Turbidity ~~ log_Turbidity 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
26 log_DissOrthophos ~~ Temp1 0.0000000 0.0000000 0.0000000 NA 0.0000000
96 Temp1 ~ log_Salinity 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
16 log_Salinity ~~ log_DissOrthophos 0.0000000 0.0000000 0.0000000 NA 0.0000000
29 log_DissOrthophos ~~ lag_logDissAmmonia 0.0000000 0.0000000 0.0000000 NA 0.0000000
22 log_Salinity ~~ lag_logDissAmmonia 0.0000000 0.0000000 0.0000000 NA 0.0000000
38 lag_logChlorophyll ~~ Temp2 0.0000000 0.0000000 0.0000000 NA 0.0000000
125 lag_logDissAmmonia ~ log_Turbidity 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
24 log_DissOrthophos ~~ log_Turbidity 0.0000000 0.0000000 0.0000000 NA 0.0000000
50 lag_logDissAmmonia ~~ lag_logDissAmmonia 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
65 log_Salinity ~ Temp3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
19 log_Salinity ~~ Temp1 0.0000000 0.0000000 0.0000000 NA 0.0000000
79 log_Turbidity ~ log_DissOrthophos 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
28 log_DissOrthophos ~~ Temp3 0.0000000 0.0000000 0.0000000 NA 0.0000000
73 log_DissOrthophos ~ Temp2 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
123 lag_logDissAmmonia ~ log_Salinity 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
18 log_Salinity ~~ lag_logChlorophyll 0.0000000 0.0000000 0.0000000 NA 0.0000000
17 log_Salinity ~~ log_Turbidity 0.0000000 0.0000000 0.0000000 NA 0.0000000
82 log_Turbidity ~ Temp2 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
81 log_Turbidity ~ Temp1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
33 log_Turbidity ~~ Temp2 0.0000000 0.0000000 0.0000000 NA 0.0000000
27 log_DissOrthophos ~~ Temp2 0.0000000 0.0000000 0.0000000 NA 0.0000000
66 log_Salinity ~ lag_logDissAmmonia 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
43 Temp1 ~~ Temp3 0.0000000 0.0000000 0.0000000 NA 0.0000000
46 Temp2 ~~ Temp3 0.0000000 0.0000000 0.0000000 NA 0.0000000
129 lag_logDissAmmonia ~ Temp3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
98 Temp1 ~ log_Turbidity 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
75 log_DissOrthophos ~ lag_logDissAmmonia 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
105 Temp2 ~ log_Salinity 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
74 log_DissOrthophos ~ Temp3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
42 Temp1 ~~ Temp2 0.0000000 0.0000000 0.0000000 NA 0.0000000
101 Temp1 ~ Temp3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
62 log_Salinity ~ lag_logChlorophyll 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
49 Temp3 ~~ lag_logDissAmmonia 0.0000000 0.0000000 0.0000000 NA 0.0000000
97 Temp1 ~ log_DissOrthophos 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
83 log_Turbidity ~ Temp3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
106 Temp2 ~ log_DissOrthophos 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
41 Temp1 ~~ Temp1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
61 log_Salinity ~ log_Turbidity 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
88 lag_logChlorophyll ~ log_DissOrthophos 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
71 log_DissOrthophos ~ lag_logChlorophyll 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
70 log_DissOrthophos ~ log_Turbidity 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
34 log_Turbidity ~~ Temp3 0.0000000 0.0000000 0.0000000 NA 0.0000000
114 Temp3 ~ log_Salinity 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
118 Temp3 ~ Temp1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
40 lag_logChlorophyll ~~ lag_logDissAmmonia 0.0000000 0.0000000 0.0000000 NA 0.0000000
124 lag_logDissAmmonia ~ log_DissOrthophos 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
115 Temp3 ~ log_DissOrthophos 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
111 Temp2 ~ lag_logDissAmmonia 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
80 log_Turbidity ~ lag_logChlorophyll 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
47 Temp2 ~~ lag_logDissAmmonia 0.0000000 0.0000000 0.0000000 NA 0.0000000
100 Temp1 ~ Temp2 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
44 Temp1 ~~ lag_logDissAmmonia 0.0000000 0.0000000 0.0000000 NA 0.0000000
72 log_DissOrthophos ~ Temp1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
63 log_Salinity ~ Temp1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
25 log_DissOrthophos ~~ lag_logChlorophyll 0.0000000 0.0000000 0.0000000 NA 0.0000000
92 lag_logChlorophyll ~ Temp3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
89 lag_logChlorophyll ~ log_Turbidity 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
36 lag_logChlorophyll ~~ lag_logChlorophyll 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
84 log_Turbidity ~ lag_logDissAmmonia 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
107 Temp2 ~ log_Turbidity 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
93 lag_logChlorophyll ~ lag_logDissAmmonia 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
108 Temp2 ~ lag_logChlorophyll 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
102 Temp1 ~ lag_logDissAmmonia 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
99 Temp1 ~ lag_logChlorophyll 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
60 log_Salinity ~ log_DissOrthophos 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
120 Temp3 ~ lag_logDissAmmonia 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
37 lag_logChlorophyll ~~ Temp1 0.0000000 0.0000000 0.0000000 NA 0.0000000
109 Temp2 ~ Temp1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
117 Temp3 ~ lag_logChlorophyll 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
127 lag_logDissAmmonia ~ Temp1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
116 Temp3 ~ log_Turbidity 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
87 lag_logChlorophyll ~ log_Salinity 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
110 Temp2 ~ Temp3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
126 lag_logDissAmmonia ~ lag_logChlorophyll 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
90 lag_logChlorophyll ~ Temp1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
31 log_Turbidity ~~ lag_logChlorophyll 0.0000000 0.0000000 0.0000000 NA 0.0000000
35 log_Turbidity ~~ lag_logDissAmmonia 0.0000000 0.0000000 0.0000000 NA 0.0000000
128 lag_logDissAmmonia ~ Temp2 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000

We note that the coefficients are very similar; the main difference is in interpretation. Here, the high RMSEA, especially, points to missing structural paths (as does the somewhat low TLI). By looking at the modification indices, we see that (similar to the d-separation tests from psem) a log_DissAmmonia ~ log_Salinity link is likely missing.

Note that traditional SEM can also evaluate residual covariance between response variables, while psem cannot (the log_Chlorophyll ~~ log_DissAmmonia row in the MI table).

Also note that, if we include latent variables, we must use traditional SEM. See the stats examples SEM file for that code.

Traditional: Plot

We can also plot traditional SEMs, just for fun:

Code
semPaths(
  fit,
  what = 'std',
  whatLabels = 'std',
  layout = 'tree2',
  nCharNodes = 5,
  edge.label.cex = 1.1,
  sizeMan = 5,
  edge.width = 1,
  esize = 1,
  residuals = FALSE,
  mar = c(7, 7, 7, 7)
)

General Additive Model (GAM)

Piecewise SEMs can also handle GAMs, which are another type of linear model. The advantage is how they handle non-linear relationships.

In the previous models, we identified the relationship between Chlorophyll and Temperature as non-linear, so we added polynomial basis functions to our formulas. GAMs essentially do the exact same thing but use splines instead, which are more sophisticated.

(I won’t go into model diagnostics for GAMs here.)

LMM-like GAM

We can essentially re-create the LMM model by switching out the Temperature terms for a smooth function and keeping the others as linear predictors:

Code
# default basis spline = thin plate
gam_chl <- gam(
  log_Chlorophyll ~
    log_Salinity +
    log_DissAmmonia +
    log_DissOrthophos +
    log_Turbidity +
    lag_logChlorophyll +
    s(Temperature, k = 6) +
    s(Region, bs = 're'), # random effect (ridge penalty)
  data = df_monthly_subset,
  method = 'REML'
)

gam_amm <- gam(
  log_DissAmmonia ~
    lag_logDissAmmonia +
    log_Turbidity +
    s(Temperature, k = 6) +
    s(Region, bs = 're'),
  data = df_monthly_subset,
  method = 'REML'
)

psem_gam <- psem(gam_chl, gam_amm, data = df_monthly_subset)
summary(psem_gam, standardize = 'scale')

  |                                                                            
  |                                                                      |   0%

  |                                                                            
  |==================                                                    |  25%

  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%

  |                                                                            
  |======================================================================| 100%

Structural Equation Model of psem_gam 

Call:
  log_Chlorophyll ~ log_Salinity + log_DissAmmonia + log_DissOrthophos + log_Turbidity + lag_logChlorophyll + s(Temperature, k = 6) + s(Region, bs = "re")
  log_DissAmmonia ~ lag_logDissAmmonia + log_Turbidity + s(Temperature, k = 6) + s(Region, bs = "re")

    AIC
 802.565

---
Tests of directed separation:

                              Independ.Claim Test.Type  DF Crit.Value P.Value   
        log_DissAmmonia ~ log_Salinity + ...      coef 347     3.1598  0.0017 **
   log_DissAmmonia ~ log_DissOrthophos + ...      coef 347     1.7240  0.0856   
  log_DissAmmonia ~ lag_logChlorophyll + ...      coef 347     0.9532  0.3412   
  log_Chlorophyll ~ lag_logDissAmmonia + ...      coef 347     2.1399  0.0331  *

--
Global goodness-of-fit:

Chi-Squared = 13.168 with P-value = 0.013 and on 4.242 degrees of freedom
Fisher's C = 26.612 with P-value = 0.001 and on 8 degrees of freedom

---
Coefficients:

         Response          Predictor Estimate Std.Error       DF Crit.Value
  log_Chlorophyll       log_Salinity  -0.0053    0.0218 347.0000    -0.2448
  log_Chlorophyll    log_DissAmmonia  -0.0874    0.0381 347.0000    -2.2951
  log_Chlorophyll  log_DissOrthophos  -0.0413     0.067 347.0000    -0.6165
  log_Chlorophyll      log_Turbidity   0.2783    0.0491 347.0000     5.6733
  log_Chlorophyll lag_logChlorophyll    0.404    0.0467 347.0000     8.6581
  log_Chlorophyll     s(Temperature)        -         -   4.8849     8.5360
  log_Chlorophyll          s(Region)        -         -   4.0000     7.4867
  log_DissAmmonia lag_logDissAmmonia   0.4968    0.0471 347.0000    10.5459
  log_DissAmmonia      log_Turbidity  -0.1321    0.0596 347.0000    -2.2159
  log_DissAmmonia     s(Temperature)        -         -   2.9887    14.2789
  log_DissAmmonia          s(Region)        -         -   4.0000     6.6010
  P.Value Std.Estimate    
   0.8068            -    
   0.0223            -   *
   0.5380            -    
   0.0000            - ***
   0.0000            - ***
   0.0000            - ***
   0.0000            - ***
   0.0000            - ***
   0.0274            -   *
   0.0000            - ***
   0.0000            - ***

  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05

---
Individual R-squared:

         Response method R.squared
  log_Chlorophyll   none      0.52
  log_DissAmmonia   none      0.55

Our results are very similar, which makes sense:

  • \(R^2\) values are almost the same (though the GAM is slightly better)

  • linear coefficient and p-values are almost the same

However, the big difference is that Temperature is now included as a smooth function instead of multiple polynomial terms. This changes the shape of its partial effect curve:

Code
df_temp_plt <- df_monthly_subset %>%
  mutate(
    poly_temp = poly(Temperature, 3)
  )

fit_chl <- lmer(
  log_Chlorophyll ~
    log_Salinity +
    log_DissAmmonia +
    log_DissOrthophos +
    log_Turbidity +
    lag_logChlorophyll +
    Temp1 +
    Temp2 +
    Temp3 +
    (1|Region),
  data = df_temp_plt
)

poly_obj <- attr(df_temp_plt$poly_temp, 'coefs')

temp_seq <- seq(
  min(df_monthly_subset$Temperature, na.rm = TRUE),
  max(df_monthly_subset$Temperature, na.rm = TRUE),
  length.out = 300
)

new_poly <- poly(temp_seq, degree = 3, coefs = poly_obj)

new_dat <- data.frame(
  log_Salinity = mean(df_monthly_subset$log_Salinity, na.rm = TRUE),
  log_DissAmmonia = mean(df_monthly_subset$log_DissAmmonia, na.rm = TRUE),
  log_DissOrthophos = mean(df_monthly_subset$log_DissOrthophos, na.rm = TRUE),
  log_Turbidity = mean(df_monthly_subset$log_Turbidity, na.rm = TRUE),
  lag_logChlorophyll = mean(df_monthly_subset$lag_logChlorophyll, na.rm = TRUE),
  Temp1 = new_poly[,1],
  Temp2 = new_poly[,2],
  Temp3 = new_poly[,3]
)

new_dat$partial <- predict(fit_chl, newdata = new_dat, re.form = NA)

new_dat$partial_centered <- new_dat$partial - mean(new_dat$partial)

ggplot(new_dat, aes(x = temp_seq, y = partial_centered)) +
  geom_line() +
  labs(
    x = 'Temperature',
    y = 'Partial Effect',
  )

Code
plot(gam_chl, select = 1)

They’re similar, but the GAM function is more flexible (it also more closely fits our original log_Chlorophyll ~ Temperature plot; this makes sense if Temperature is an influential predictor). We also note that our chi-square goodness of fit test is now significant, meaning this model (with a more correctly modeled Temperature) is a poorer fit, possibly due to missing variables.

This all being said, for practical purposes, the LMM gives us a similar overall picture to the GAM; both are valid if you correctly specify the model structure.

All Smooths

You could also make all variables smooth terms if you suspect other relationships may be non-linear (though for truly linear relationships it’s better not to).

Code
gam_chl <- gam(
  log_Chlorophyll ~
    s(log_Salinity, k = 6) +
    s(log_DissAmmonia, k = 6) +
    s(log_DissOrthophos, k = 6) +
    s(log_Turbidity, k = 6) +
    s(lag_logChlorophyll, k = 6) +
    s(Temperature, k = 6) +
    s(Region, bs = 're'),
  data = df_monthly_subset,
  method = 'REML'
)

gam_amm <- gam(
  log_DissAmmonia ~
    s(lag_logDissAmmonia, k = 6) +
    s(log_Turbidity, k = 6) +
    s(Temperature, k = 6) +
    s(Region, bs = 're'),
  data = df_monthly_subset,
  method = 'REML'
)

psem_gam <- psem(gam_chl, gam_amm, data = df_monthly_subset)
summary(psem_gam, standardize = 'scale')

  |                                                                            
  |                                                                      |   0%

  |                                                                            
  |==================                                                    |  25%

  |                                                                            
  |===================================                                   |  50%

  |                                                                            
  |====================================================                  |  75%

  |                                                                            
  |======================================================================| 100%

Structural Equation Model of psem_gam 

Call:
  log_Chlorophyll ~ s(log_Salinity, k = 6) + s(log_DissAmmonia, k = 6) + s(log_DissOrthophos, k = 6) + s(log_Turbidity, k = 6) + s(lag_logChlorophyll, k = 6) + s(Temperature, k = 6) + s(Region, bs = "re")
  log_DissAmmonia ~ s(lag_logDissAmmonia, k = 6) + s(log_Turbidity, k = 6) + s(Temperature, k = 6) + s(Region, bs = "re")

    AIC
 777.996

---
Tests of directed separation:

                                        Independ.Claim Test.Type     DF
        log_DissAmmonia ~ s(log_Salinity, k = 6) + ...     anova 2.0358
   log_DissAmmonia ~ s(log_DissOrthophos, k = 6) + ...     anova 1.0007
  log_DissAmmonia ~ s(lag_logChlorophyll, k = 6) + ...     anova 2.1390
  log_Chlorophyll ~ s(lag_logDissAmmonia, k = 6) + ...     anova 1.0001
  Crit.Value P.Value  
      3.4922  0.0285 *
      3.8124  0.0517  
      0.8399  0.4843  
      5.0868  0.0248 *

--
Global goodness-of-fit:

Chi-Squared = 14.835 with P-value = 0.033 and on 6.736 degrees of freedom
Fisher's C = 21.886 with P-value = 0.005 and on 8 degrees of freedom

---
Coefficients:

         Response             Predictor Estimate Std.Error     DF Crit.Value
  log_Chlorophyll       s(log_Salinity)        -         - 1.0004     0.0001
  log_Chlorophyll    s(log_DissAmmonia)        -         - 3.0864     3.6556
  log_Chlorophyll  s(log_DissOrthophos)        -         - 1.0001     0.7318
  log_Chlorophyll      s(log_Turbidity)        -         - 1.2119    22.7294
  log_Chlorophyll s(lag_logChlorophyll)        -         - 1.0005    70.6783
  log_Chlorophyll        s(Temperature)        -         - 4.9104     9.6466
  log_Chlorophyll             s(Region)        -         - 4.0000     8.2700
  log_DissAmmonia s(lag_logDissAmmonia)        -         - 4.7537    29.7694
  log_DissAmmonia      s(log_Turbidity)        -         - 3.2500     3.6661
  log_DissAmmonia        s(Temperature)        -         - 1.0031    31.4752
  log_DissAmmonia             s(Region)        -         - 4.0000     5.3363
  P.Value Std.Estimate    
   0.9960            -    
   0.0115            -   *
   0.3929            -    
   0.0000            - ***
   0.0000            - ***
   0.0000            - ***
   0.0000            - ***
   0.0000            - ***
   0.0125            -   *
   0.0000            - ***
   0.0001            - ***

  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05

---
Individual R-squared:

         Response method R.squared
  log_Chlorophyll   none      0.53
  log_DissAmmonia   none      0.58

Our \(R^2\) improved, though our interpretations of significance didn’t change much, which makes sense. The biggest advantage is that, by examining the individual GAMs, we can visualize/interpret any underlying non-linear relationships in a way that LMMs can’t.