This document summarizes bacteria exceedance prediction models for the Mystic River Recreational Flagging project.
A logistic regression model was fit to each of three locations (Mystic River @ Rt 16 Bridge, Lower Malden River @ Rt 16 Bridge, Shannon Beach @ Upper Mystic Lake). Robust models could not be fit to the other potential locations (Blessing of the Bay Boathouse, Wedge Pond) either because there were not good climatic and hydrologic predictor variables, or because the exceedance frequency was too low.
At each location, the data were split into training (75%) and testing (25%) subsets using stratified random sampling (i.e. each subset contains approximately the same percent of exceedances). The training subsets were then fed into multiple classification models including logistic regression (using all variables, and step-wise feature selection based on AIC), elastic net regression, random forest, gradient boosting, and support vector machine. Using the caret
package, each type of model was optimized over its tuning parameters. The variable importance functions of each model were then used to identify which predictors were most common among the different types of models. In general, the step-wise logistic regression model to minimize AIC performed very well, and resulted in the set of strongest predictors at each site. Various combinations of these predictors were then fit to a standard logistic regression model using 10-fold cross validation repeated 3 times. The final model was then based on the set of predictors that performed best.
Each model can be evaluated by various metrics. Descriptions and interpretation for each metric (more info here are as follows:
Following the guidance by Francy et al (2013), model acceptance was based on the following goals:
The following table summarizes the performance of each model on the training datasets. All three models meet each of these criteria.
Model ID | # Samples | % Exceedances | Null Deviance | Model Deviance | No Information Rate | Accuracy | Kappa | ROC AUC | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value |
---|---|---|---|---|---|---|---|---|---|---|---|---|
MALDENLOWER_ECOLI | 237 | 11.39 | 168.10 | 90.73 | 0.89 | 0.91 | 0.59 | 0.92 | 0.70 | 0.94 | 0.59 | 0.96 |
MYSTIC_ECOLI | 138 | 11.59 | 99.02 | 66.26 | 0.88 | 0.91 | 0.55 | 0.89 | 0.62 | 0.94 | 0.59 | 0.95 |
SHANNON_ENT | 232 | 4.31 | 82.45 | 50.77 | 0.96 | 0.94 | 0.41 | 0.92 | 0.60 | 0.95 | 0.35 | 0.98 |
This table shows the same metrics based on the testing subset for each model (excludes deviance metrics that are not relevant for the testing subset). The Mystic and Malden models continue to meet the criteria. The Shannon Beach model, however, has relatively poor performance. However, the overall frequency of exceedances is much lower at this site relative to the others. The training dataset only included 3 exceedances (out of 77 samples total), so these relatively low performance metrics are to be expected.
Model ID | # Samples | % Exceedances | No Information Rate | Accuracy | Kappa | ROC AUC | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value |
---|---|---|---|---|---|---|---|---|---|---|
MALDENLOWER_ECOLI | 77 | 10.39 | 0.90 | 0.92 | 0.62 | 0.93 | 0.75 | 0.94 | 0.6 | 0.97 |
MYSTIC_ECOLI | 45 | 11.11 | 0.89 | 0.89 | 0.55 | 0.91 | 0.80 | 0.90 | 0.5 | 0.97 |
SHANNON_ENT | 77 | 3.90 | 0.96 | 0.92 | 0.21 | 0.88 | 0.33 | 0.95 | 0.2 | 0.97 |
Number of exceedance (yes
) and non-exceedance (no
) samples by project and monitoring location.
## yes no Total
## CSORWM - MWRA177 15 89 104
## RECFLAG - MYR0435 6 73 79
## Total 21 162 183
Dataset was split into training (75%) and testing (25%) subsets.
## yes no Total
## Training 16 122 138
## Testing 5 40 45
## Total 21 162 183
The logistic regression model for the Mystic River is based on 3 predictors:
precip_sum_p24hr_lag0hr
: total rainfall (inches) over 24 hours prior to sample timestampprecip_sum_p24hr_lag24hr
: total rainfall (inches) between 24 and 48 hours prior to sample timestamphours_since_050in_precip_event
: number of hours since last 0.5 inch rainfall eventNote that the sum of precip_sum_p24hr_lag0hr + precip_sum_p24hr_lag24hr
is equal to the 48-hour antecedent rainfall. However, splitting that amount into two separate variables resulted in a better model. They are also not strongly correlated.
All predictors were centered and scaled prior to fitting.
Diagnostic output of the logistic regression model:
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7030 0.2280 0.2758 0.3597 2.0812
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.5727 0.3804 6.762 1.36e-11 ***
## precip_sum_p24hr_lag0hr -1.5384 0.3761 -4.090 4.31e-05 ***
## precip_sum_p24hr_lag24hr -0.4526 0.2748 -1.647 0.0996 .
## hours_since_050in_precip_event -0.4421 0.3643 -1.214 0.2249
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 99.018 on 137 degrees of freedom
## Residual deviance: 66.262 on 134 degrees of freedom
## AIC: 74.262
##
## Number of Fisher Scoring iterations: 6
This model accounts for 66.9% of the total deviance (pseudo-R^2 value of 0.33).
Relationships between E. coli concentration and value of each predictor.
## Warning: Suppressing axis rendering when strip.position = 'bottom' and
## strip.placement == 'outside'
Data Subset | Area Under Curve (AUC) |
---|---|
Testing | 0.910000 |
Training | 0.892418 |
The logistic regression model predicts the probability of an exceedance. In order to classify that probability as being an exceedance or not, we need to determine an optimal cutoff probability. By default, logistic regression algorithms tend to use 0.5 as the cutoff. However, when the dataset has large class imbalance (exceedances are far less frequent than non-exceedances), then the optimal cutoff probability can be higher or lower than 0.5. One way to determine the optimal cutoff is to compute the accuracy (Number of Accurate Predictions), specificity (True Negative Rate), and sensitivity (True Positive Rate) for a range of cutoff thresholds. The following plots show these three metrics using both the training and testing subsets. The distance
metric is the combined distance of both the sensitivity and specificity to values of 1. The goal is to minimize the distance metric, while not sacrificing accuracy. The optimal cutoff for this model was determined to be about 0.2.
## Warning: Removed 3 rows containing missing values (geom_path).
For the training dataset, the following shows various metrics using the confusion matrix (i.e. tabulation of true positive, false negative, true negative, and false positive). The overall accuracy should be greater than the No Information Rate, which is simply the proportion of non-exceedance samples in the dataset. If the accuracy is less than the No Information Rate, then a naive model based solely on the proportion of exceedances and non-exceedances is a better predictor than the model.
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 10 7
## no 6 115
##
## Accuracy : 0.9058
## 95% CI : (0.8443, 0.9489)
## No Information Rate : 0.8841
## P-Value [Acc > NIR] : 0.2594
##
## Kappa : 0.5526
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.62500
## Specificity : 0.94262
## Pos Pred Value : 0.58824
## Neg Pred Value : 0.95041
## Prevalence : 0.11594
## Detection Rate : 0.07246
## Detection Prevalence : 0.12319
## Balanced Accuracy : 0.78381
##
## 'Positive' Class : yes
##
For the test dataset, the confusion matrix and associated summary metrics are:
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 4 4
## no 1 36
##
## Accuracy : 0.8889
## 95% CI : (0.7595, 0.9629)
## No Information Rate : 0.8889
## P-Value [Acc > NIR] : 0.6162
##
## Kappa : 0.5545
## Mcnemar's Test P-Value : 0.3711
##
## Sensitivity : 0.80000
## Specificity : 0.90000
## Pos Pred Value : 0.50000
## Neg Pred Value : 0.97297
## Prevalence : 0.11111
## Detection Rate : 0.08889
## Detection Prevalence : 0.17778
## Balanced Accuracy : 0.85000
##
## 'Positive' Class : yes
##
During model development, the logistic regression model was fitted to 30 random resamples of the training subset using 10-fold cross-validation repeated 3 times. For each resample, the coefficients were fitted and then ROC statistics computed using a cutoff probability of 0.2. This boxplot shows the distribution of each statistic across the 30 resamples (red circle = mean). This figure gives some indication as to how sensitive the model is to the specific dataset. The final model was fitted to all of the data in the training subset.
Number of exceedance (yes
) and non-exceedance (no
) samples by project and monitoring location.
## yes no Total
## CSORWM - MWRA176 26 210 236
## RECFLAG - MAR0065 9 69 78
## Total 35 279 314
Dataset was split into training (75%) and testing (25%) subsets.
## yes no Total
## Training 27 210 237
## Testing 8 69 77
## Total 35 279 314
The logistic regression model for the Mystic River is based on 5 predictors:
precip_sum_p24hr_lag0hr
: total rainfall (inches) over 24 hours prior to sample timestampprecip_sum_p24hr_lag24hr
: total rainfall (inches) between 24 and 48 hours prior to sample timestamphours_since_100in_precip_event
: number of hours since last 1.0 inch rainfall eventpressure_change_p72hr
: change in barometric pressure (inHg) from 72 hours prior to sample timestamp (when negative, indicates storms moving in)temp_change_p72hr
: change in air temperature (degF) from 72 hours prior to sample timestampNote that the sum of precip_sum_p24hr_lag0hr + precip_sum_p24hr_lag24hr
is equal to the 48-hour antecedent rainfall. However, splitting that amount into two separate variables resulted in a better model. They are also not strongly correlated.
All predictors were centered and scaled prior to fitting.
Diagnostic output of the logistic regression model:
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7908 0.1088 0.2001 0.3294 1.5014
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.1521 0.4176 7.549 4.39e-14 ***
## precip_sum_p24hr_lag0hr -1.2656 0.2905 -4.357 1.32e-05 ***
## precip_sum_p24hr_lag24hr -0.7792 0.2425 -3.214 0.00131 **
## hours_since_100in_precip_event -0.6612 0.2669 -2.477 0.01324 *
## pressure_change_p72hr 1.1378 0.3525 3.228 0.00125 **
## temp_change_p72hr 0.7718 0.3494 2.209 0.02717 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 168.100 on 236 degrees of freedom
## Residual deviance: 90.726 on 231 degrees of freedom
## AIC: 102.73
##
## Number of Fisher Scoring iterations: 6
This model accounts for 54% of the total deviance (pseudo-R^2 value of 0.46).
Relationships between E. coli concentration and value of each predictor.
## Warning: Suppressing axis rendering when strip.position = 'bottom' and
## strip.placement == 'outside'
Data Subset | Area Under Curve (AUC) |
---|---|
Testing | 0.9311594 |
Training | 0.9179894 |
The logistic regression model predicts the probability of an exceedance. In order to classify that probability as being an exceedance or not, we need to determine an optimal cutoff probability. By default, logistic regression algorithms tend to use 0.5 as the cutoff. However, when the dataset has large class imbalance (exceedances are far less frequent than non-exceedances), then the optimal cutoff probability can be higher or lower than 0.5. One way to determine the optimal cutoff is to compute the accuracy (Number of Accurate Predictions), specificity (True Negative Rate), and sensitivity (True Positive Rate) for a range of cutoff thresholds. The following plots show these three metrics using both the training and testing subsets. The distance
metric is the combined distance of both the sensitivity and specificity to values of 1. The goal is to minimize the distance metric, while not sacrificing accuracy. The optimal cutoff for this model was determined to be about 0.2.
## Warning: Removed 3 rows containing missing values (geom_path).
For the training dataset, the following shows various metrics using the confusion matrix (i.e. tabulation of true positive, false negative, true negative, and false positive). The overall accuracy should be greater than the No Information Rate, which is simply the proportion of non-exceedance samples in the dataset. If the accuracy is less than the No Information Rate, then a naive model based solely on the proportion of exceedances and non-exceedances is a better predictor than the model.
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 19 13
## no 8 197
##
## Accuracy : 0.9114
## 95% CI : (0.8677, 0.9443)
## No Information Rate : 0.8861
## P-Value [Acc > NIR] : 0.1285
##
## Kappa : 0.5939
## Mcnemar's Test P-Value : 0.3827
##
## Sensitivity : 0.70370
## Specificity : 0.93810
## Pos Pred Value : 0.59375
## Neg Pred Value : 0.96098
## Prevalence : 0.11392
## Detection Rate : 0.08017
## Detection Prevalence : 0.13502
## Balanced Accuracy : 0.82090
##
## 'Positive' Class : yes
##
For the test dataset, the confusion matrix and associated summary metrics are:
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 6 4
## no 2 65
##
## Accuracy : 0.9221
## 95% CI : (0.8381, 0.9709)
## No Information Rate : 0.8961
## P-Value [Acc > NIR] : 0.2999
##
## Kappa : 0.6232
## Mcnemar's Test P-Value : 0.6831
##
## Sensitivity : 0.75000
## Specificity : 0.94203
## Pos Pred Value : 0.60000
## Neg Pred Value : 0.97015
## Prevalence : 0.10390
## Detection Rate : 0.07792
## Detection Prevalence : 0.12987
## Balanced Accuracy : 0.84601
##
## 'Positive' Class : yes
##
During model development, the logistic regression model was fitted to 30 random resamples of the training subset using 10-fold cross-validation repeated 3 times. For each resample, the coefficients were fitted and then ROC statistics computed using a cutoff probability of 0.2. This boxplot shows the distribution of each statistic across the 30 resamples (red circle = mean). This figure gives some indication as to how sensitive the model is to the specific dataset. The final model was fitted to all of the data in the training subset.
Number of exceedance (yes
) and non-exceedance (no
) samples by project and monitoring location.
## yes no Total
## DCRBCH - UPLSHBM 10 224 234
## RECFLAG - UPLSHBC 3 72 75
## Total 13 296 309
Dataset was split into training (75%) and testing (25%) subsets.
## yes no Total
## Training 10 222 232
## Testing 3 74 77
## Total 13 296 309
The logistic regression model for the Mystic River is based on 1 predictor:
aberjona_logflow_p1d
: daily mean streamflow (log-10 transformed) at Aberjona River gauge (Station ID: 01102500
) on the day before sample was collectedAll predictors were centered and scaled prior to fitting.
Diagnostic output of the logistic regression model:
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.89322 0.05378 0.10094 0.20426 1.39868
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.0049 0.8355 5.99 2.1e-09 ***
## aberjona_logflow_p1d -2.2503 0.5161 -4.36 1.3e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 82.446 on 231 degrees of freedom
## Residual deviance: 50.768 on 230 degrees of freedom
## AIC: 54.768
##
## Number of Fisher Scoring iterations: 8
This model accounts for 61.6% of the total deviance (pseudo-R^2 value of 0.38).
Relationships between E. coli concentration and value of each predictor.
Data Subset | Area Under Curve (AUC) |
---|---|
Testing | 0.8828829 |
Training | 0.9247748 |
The logistic regression model predicts the probability of an exceedance. In order to classify that probability as being an exceedance or not, we need to determine an optimal cutoff probability. By default, logistic regression algorithms tend to use 0.5 as the cutoff. However, when the dataset has large class imbalance (exceedances are far less frequent than non-exceedances), then the optimal cutoff probability can be higher or lower than 0.5. One way to determine the optimal cutoff is to compute the accuracy (Number of Accurate Predictions), specificity (True Negative Rate), and sensitivity (True Positive Rate) for a range of cutoff thresholds. The following plots show these three metrics using both the training and testing subsets. The distance
metric is the combined distance of both the sensitivity and specificity to values of 1. The goal is to minimize the distance metric, while not sacrificing accuracy. The optimal cutoff for this model was determined to be about 0.2.
## Warning: Removed 7 rows containing missing values (geom_path).
## Warning: Removed 8 rows containing missing values (geom_path).
For the training dataset, the following shows various metrics using the confusion matrix (i.e. tabulation of true positive, false negative, true negative, and false positive). The overall accuracy should be greater than the No Information Rate, which is simply the proportion of non-exceedance samples in the dataset. If the accuracy is less than the No Information Rate, then a naive model based solely on the proportion of exceedances and non-exceedances is a better predictor than the model.
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 6 11
## no 4 211
##
## Accuracy : 0.9353
## 95% CI : (0.8956, 0.9634)
## No Information Rate : 0.9569
## P-Value [Acc > NIR] : 0.9550
##
## Kappa : 0.4126
## Mcnemar's Test P-Value : 0.1213
##
## Sensitivity : 0.60000
## Specificity : 0.95045
## Pos Pred Value : 0.35294
## Neg Pred Value : 0.98140
## Prevalence : 0.04310
## Detection Rate : 0.02586
## Detection Prevalence : 0.07328
## Balanced Accuracy : 0.77523
##
## 'Positive' Class : yes
##
For the test dataset, the confusion matrix and associated summary metrics are:
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 1 4
## no 2 70
##
## Accuracy : 0.9221
## 95% CI : (0.8381, 0.9709)
## No Information Rate : 0.961
## P-Value [Acc > NIR] : 0.9695
##
## Kappa : 0.2116
## Mcnemar's Test P-Value : 0.6831
##
## Sensitivity : 0.33333
## Specificity : 0.94595
## Pos Pred Value : 0.20000
## Neg Pred Value : 0.97222
## Prevalence : 0.03896
## Detection Rate : 0.01299
## Detection Prevalence : 0.06494
## Balanced Accuracy : 0.63964
##
## 'Positive' Class : yes
##
During model development, the logistic regression model was fitted to 30 random resamples of the training subset using 10-fold cross-validation repeated 3 times. For each resample, the coefficients were fitted and then ROC statistics computed using a cutoff probability of 0.2. This boxplot shows the distribution of each statistic across the 30 resamples (red circle = mean). This figure gives some indication as to how sensitive the model is to the specific dataset. The final model was fitted to all of the data in the training subset.
The sensitivity for this model is relatively low. However, this is to be expected given that only 4.21% of all samples were exceedances.