The effects of nutrients on stream invertebrates: a regional estimation by generalized propensity score

Introduction The effects of nutrients on stream conditions within individual streams or small areas have been studied extensively, but the same effects over a large region have rarely been examined due to the difficulty of applying large-scale manipulative experiments. In this study, we estimated the causal effects of nutrients within the Western United States on invertebrate richness, an important biological indicator of stream conditions, by using observational data. Methods We used the generalized propensity score method to avoid the common problem of statistical inference using observational data, i.e., correlation established based on observational data does not imply a causal relationship because the effects of confounding factors are not properly separated. Results Our analysis showed a subsidy-stress relationship between nutrients and invertebrate taxon richness in the whole Western United States and in its sub-ecoregions. The magnitude of the relationship varies among these sub-ecoregions, suggesting a varying nitrogen effect on macroinvertebrates due, in large part, to the varying natural and anthropogenic conditions from ecoregion to ecoregion. Furthermore, our analysis confirmed that causal estimation results using regression can be sensitive to the imbalance of confounding factors. Conclusions Stratifying data into ecoregions with relatively homogeneous environmental conditions or adjusting data by generalized propensity score can improve the balance of confounding factors, thereby allowing more reliable causal inference of nutrient effects. Invertebrates respond to the same nutrient levels differently across different site conditions.


Introduction
Nutrients are essential for maintaining an ecosystem's structure and function. Knowledge of the effects of excessive nutrients on ecosystems is important for environmental management. In streams, increased nutrient concentrations have altered biological structures and functions such as species richness, composition, abundance, and decomposition rate (Dodson et al. 2000;Freeman et al. 2009;Smith et al. 1999;Gulis and Suberkropp 2004;Rosemond et al. 1993). Excessive nutrients can also reduce water quality causing problems for drinking water and can deplete dissolved oxygen, leading to fish kills (USEPA 1996). For example, 40% of rivers in the USA have been impaired primarily as a result of excessive nutrients (USEPA 1996).
Invertebrates occupy an important ecological niche in streams, and their aggregated measures such as total taxon richness are widely used for stream condition assessments (Fore et al. 1996;Moss et al. 1987). However, while a few key invertebrate taxon grazers have been examined in many field and laboratory studies, relatively little work has been done to examine the effects of nutrients on aggregate measures of invertebrate assemblages (e.g., richness) (Yuan 2010; Cross et al. 2006;Quinn et al. 1997). As a result, our current understanding of the causal effects between increased nutrients and invertebrate richness (IR) is still limited.
How responsively does the IR change with nutrients, and does the causal relationship vary regionally? The causal relationship details are necessary to inform management actions and provide proper measures. A causal relationship is ideally quantified using manipulative experiments where treatments are randomly applied to replicated samples. This approach eliminates the effects of confounding factors by adequately resolving the counterfactual problem (Maldonado and Greenland 2002) in a causal analysis. Such manipulative studies (e.g., Cross et al. 2006;Gafner and Robinson 2007;Hart and Robinson 1990;Slavik et al. 2004) have increased our understanding of the effects of nutrients on streams but were usually conducted in a very small area or single whole stream/channel representing limited conditions. As a result, it is difficult to draw general conclusions from those studies for a region (e.g., the regional average effect) for setting regional nutrient criteria. Applying randomized experiments is difficult in the case of many streams across a large landscape. An alternative approach to study regional average effect is to use observational data that have been collected from many streams spanning different conditions/locations, which might produce complemental knowledge to that what we have gained from manipulative studies.
Observational data, by definition, are collected without a random sampling mechanism with respect to the effect of the variable of interest. When observational data are used without properly addressing potential problems induced by the non-random nature of the data collection process (e.g., imbalance of factors other than the variable of interest, i.e., confounding factors), results can be biased (Qian and Harmel 2016). However, rare studies in the ecological literature have addressed this problem of confounding factors, which can lead to divergent results of the same problem. For example, Clenaghan et al. (1998) reported a positive association between nutrients and benthic macroinvertebrates in one catchment in Ireland; Heino et al. (2003) reported the same positive association in a river in Finland; Bergfur et al. (2007) found a negative correlation in streams of central Sweden; Wang et al. (2007) demonstrated a negative association in some wadeable streams in Wisconsin; and Harding et al. (1999) and Niyogi et al. (2007) showed no clear link between macroinvertebrates and nutrient concentrations in one river and in a suite of 21 streams of southern New Zealand, respectively. These divergent results are expected, as these studies focused on local streams and each stream may have different confounding factors (e.g., watershed land use patterns, habitat quality, and flow conditions). Because nutrient criteria are usually developed for a large geographic region and not for individual streams or at a local level, understanding the regional (average) effects of nutrient enrichment is necessary. In this study, we aim to evaluate the regional average effects of nutrients on stream invertebrate taxon richness in the Western United States and its individual ecoregions from observational data.
An effect, or more specifically a causal effect, of a nutrient on invertebrate richness cannot be equated to the correlation between the two variables when the data used are observational data. In observational data, a treatment (i.e., nitrogen concentrations) is "assigned" to each site through some unknown and likely non-random processes. The resulting data are not balanced with respect to confounding factors. In other words, observed values of a confounding factor cannot be the same for streams with different observed treatment levels (nitrogen concentration). This imbalance often leads to a biased estimate of the causal relationship. One statistical approach to this problem is the use of propensity score matching (Rosenbaum and Rubin 1983;Rubin 2006). The propensity score matching approach was designed for binary treatment variables. It has been used for assessing the effectiveness of agricultural conservation practices on nutrient loss (Qian and Harmel 2016). In our study, the treatment variable, nitrogen concentration, is a continuous variable. We use the generalized propensity score method (Hirano and Imbens 2004;Imai and van Dyk 2004), which estimates the causal relationship by averaging out the effects of known confounding factors. The generalized propensity score for continuous treatments is an extension of the well-established and widely used propensity score methodology for binary treatments (Rosenbaum and Rubin 1983) and multivalued treatments (Rubin 2006).
Here, we used the generalized propensity score method of Hirano and Imbens (2004). This method does not presume any specific linear or nonlinear relationship and allows for flexibility. Our specific questions were "How does the invertebrate taxon richness change with increased eutrophication in the Western United States, and does this causal relationship vary by ecoregion?" Answers to these questions can provide baseline scientific information for nutrient criteria development.

Data
We used observational data collected by the USEPA at wadeable stream reaches from 12 western US states (Washington, Montana, North Dakota, South Dakota, Wyoming, Idaho, Oregon, Nevada, Utah, California, Colorado, and Arizona) during the summers of 2000-2002 ( Fig. 1) (Stoddard et al. 2006). Randomness was achieved through a probability-based sample design under the Environmental Monitoring and Assessment Program (EMAP) (Blair 2001). Extensive biological, physical, chemical, and landscape-scale measurements were collected at each sampled site (USEPA 2000), but we only used those related to our study (Table 1). In total, 670 randomly sampled stream sites that had a complete observation of these variables were included in this study. We used total invertebrate taxon richness as our response variable. Invertebrate richness (IR) was measured as the total number of distinct invertebrate taxa observed in each sample.
Nutrient conditions in these streams are represented by total nitrogen (TN, in μg/L). It is used as the treatment variable. Stream periphyton could either be nitrogen (N) or phosphorus (P) limited, but both P and N additions stimulate periphyton growth (Francoeur 2001;Elser et al. 2007). Total P is highly correlated with TN in this dataset (Fig. 2a), and the stoichiometric ratio of N:P in our dataset is mostly below the Redfield ratio ( Fig. 2b), suggesting N limitation. Therefore, we assumed that both TP and TN influenced stream biota and that TN concentrations can represent the effects of both nutrients across this wide range of streams in our study.
We considered 13 important covariates (variables that co-varied with nutrient concentration) as major Fig. 1 The sampled stream sites as well as their associated level I ecoregions defined by USEPA. Three subsets (strata) were extracted for analysis: the Great Plains' streams (blue points), the Northwestern Forested Mountains' streams (red points), and the North American Deserts' streams (green points) confounding factors of the TN-IR dose-response relationship (Table 1). Covariates were identified by examining bivariate scatter plots and by investigating the correlation between TN and each candidate variable. We used variables with statistically significant correlation with TN (p < 0.05). Methods of field collection and variable extraction of the 13 covariates can be found in Yuan (2010) and Stoddard et al. (2006). TN, precipitation, and other chemistry measurements (e.g., chloride ion) were highly skewed and were thus log-transformed (base 10) prior to analysis (Table 1).

Data stratification
When using data from regions where the average natural conditions are systematically different, different TN-IR dose-response models may be produced. We illustrated this by stratifying the data into level I ecoregions of North America (Omernik 1987) (Fig. 1). Our full dataset fell into six different ecoregion categories. However, we only included the three ecoregions (strata) that had a sample size large enough to run our models. Stratum one included streams in the Northwestern Forested Mountains (n = 345), stratum two included all streams within the Great Plains (n = 120), and stratum three included all streams within the North American Deserts (n = 99). These three ecoregions have different natural conditions as well as different levels/types of human activities (Table 2) and are thus good candidates to demonstrate regional divergence.

An overview of statistical methods for causal inference
The fundamental concept in causal inference is the concept of counterfactual, which requires that the responses to treatment and control be measured from the same subject (Maldonado and Greenland 2002). For example, assessing the causal effect of TN ideally requires the quantification of the IR increase/decrease due to the only change in TN, which means that we have to compare potential outcome IR observed at two different levels of TN (e.g., 0.1 mg/L, 0.2 mg/L) under identical conditions (e.g., the same site and the same moment) to avoid any confounding factors. However, only one of the two potential outcomes can be actually observed; therefore, it is the counterfactual. The statistical solution to the counterfactual problem is Fisher's randomized experiments (Fisher 1966), such that the average effect of a treatment can be quantified with a reasonable level of confidence. The propensity score matching method of Rosenbaum and Rubin (1983) is the most widely used method for causal inference with a binary treatment for observational data. The propensity score is defined as the conditional probability of a subject receiving the treatment given all observed covariates and is modeled as a function of the observed covariates. Instead of matching observed units either directly or by using a nearest-neighbor method in multiple dimensions, this propensity score makes matching possible in one dimension. Rosenbaum and Rubin (1983) proved that matching and sub-classifying with propensity scores can sufficiently remove bias due to confounding factors (see Qian and Harmel (2016) for an example of environmental application). Hirano and Imbens (2004) and Imai and van Dyk (2004) extended the propensity score method to include a continuous treatment variable, which they called a generalized propensity score (GPS). Like the propensity score, a GPS is defined as the conditional probability density function of the treatment given the covariates: where f is the probability density function and T is the treatment set, X is the covariate(s), and t∈T and x∈X. Under a weak unconfoundedness assumption (which implies that all important confounding factors are included in the model for deriving the propensity score), Hirano and Imbens (2004) showed that the GPS is a balancing score, which means that sample units with similar propensity scores have similar covariates independent of treatment levels. Hence, grouping sampling units with similar generalized propensity scores is an effective Fig. 2 The correlation between total nitrogen and total phosphorus at log scale (a) and the stoichiometric relationship (b) means of removing or reducing confounding effects (Hirano and Imbens 2004;Imai and van Dyk 2004).

Implementation of the generalized propensity score
We implemented the generalized propensity score method presented in Hirano and Imbens (2004) to both the full dataset and the three strata. First, we used regression to derive the probability distribution of the treatment variable (log(TN)). This is a purely mathematical step. The distribution (f(t|x)) in Eq. (1) is usually developed using an empirical (e.g., linear regression) model: where T represents the treatment (log(TN)), X represents the vector of covariates (Table 1), and ε is the residual term assumed to follow a normal distribution N (0, σ). The regression coefficients and σ are estimated by using maximum likelihood. After fitting the model, the model's residuals σ were used as a measure of how well each observed log(TN) was predicted by the covariates. Probability densities of these residuals are: The generalized propensity score for each observation (i) is the probability density R i = r(T i , X i ) (i.e., the likelihood of receiving treatment T i , given observed covariates X i ), which was used to account for the aggregated effects of all the covariates in Eqs. (4) and (5). We emphasize here that the generalized propensity score is not the predicted treatment outcome, as in Eq. (2), but instead was the probability density of the residuals of the prediction, as in Eq. (3).
The second step is the estimation of the expectation of the outcome IR conditional on the observed covariates and treatment levels. For each observation, we have the calculated generalized propensity score R that represents the effects of all confounding factors and expect that the response variable is a function of both the treatment and confounding factors; therefore, we can use a polynomial regression model to approximate the functional dependency of IR on T and R: where E is the expectation operator. The parameters in Eq. (4) are estimated by ordinary least squares. This process is similar to conventional multiple regression where the covariates are included as predictors. As a result, it is still too early to interpret the regression in Eq. (4) as a causal effect (Hirano and Imbens 2004), because the conditional expectation of the outcomes is still conditional on the observed covariates (e.g., represented by GPS) which are different among all units, and a causal interpretation should compare expected outcomes with the same score but different treatment levels. One approach is to divide samples into groups with similar scores and then apply Eq. (4) to approach causal interpretation by averaging the coefficients from the groups  (Imai and van Dyk 2004). Alternatively, Hirano and Imbens (2004) suggested that the average dose-response model could be estimated by integrating out the aggregated confounding factor represented by R i . Numerically, we can approximate the integration by averaging equation over R i evaluated at a series of treatment levels.
That is, for a given log(TN) concentration t, the average treatment effect is: where u(t) is the average potential outcome at treatment level t,rðt; X i Þ is calculated based on Eq.

Verification of generalized propensity score
The generalized propensity score is a balancing score (Hirano and Imbens 2004;Imai and van Dyk 2004) when the model specification is appropriate. In other words, when observations are grouped into subsets with similar propensity scores, covariates within a subset should be similar among different treatment levels when the model used to derive the scores is appropriate. We use this expectation to verify the adequacy of the model specification. Hirano and Imbens (2004) proposed the use of a standardized difference to measure the balance in two steps. First, data were grouped into a number of categories (e.g., three) based on log TN (the treatment), each with roughly the same sample size. To show the balance in a confounding factor, we compare the means of the confounding factor among the three categories. The comparison can be made using a standardized difference, the difference of the means in the two categories divided by the standard error of the difference. The standard difference is similar to the t statistics in a two-sample t test. This comparison is first carried out directly to the data without GPS adjustment to illustrate the imbalance in the data. After calculating the propensity scores, we then divide the data into subsets of similar propensity scores and repeat the process of comparing confounding factors among three log TN categories within each relatively homogeneous subset of data. That is, in the second step, the data were divided into a number of subsets based on the calculated propensity scores (confounding factors are balanced within each subset), and within each subset, we further divide the data based on the treatment (log TN) to calculate the standardized mean differences of each confounding factor. The weighted averages of these standardized mean differences are known as the GPS-adjusted mean differences (Hirano and Imbens 2004). Weights of each subset are determined by the subset sample sizes. When the standardized differences of a confounding factor are between − 2 and 2, we consider that the confounding factor is balanced. Through comparing the change in the standardized differences calculated from the two steps, we can show the improvement in terms of confounding factor balance due to propensity score.

Comparison with regression without propensity score adjustment
We compared the dose-response estimation from the generalized propensity score to conventional regression models (without using propensity score adjustment). For the entire dataset, the Forested Mountains stratum and the North American Deserts stratum, we fitted generalized additive models (GAM) to compare with the generalized propensity score because of the nonlinear subsidy relationship. The selection of a smoothness parameter impacts the result of GAM. We used the default smoothness parameter in R package mgcv, which is determined based on a cross-validation simulation for optimal predictive features. The relationship was approximately linear for the Great Plains stratum. We thus fitted the simple linear and multiple linear regressions for comparison, which were then compared with the dose-response function estimated by generalized propensity score. 95% confidence zone of estimation from the generalized propensity scores was computed from a 1000 times bootstrap analysis. The statistical software used for all analyses was R version 3.24.

Balancing check
The imbalance of confounding factors is shown by the standardized differences of the means of these variables among data groups with different log TN levels (Tables 3,  4, 5, and 6). The level of imbalance also varies by ecoregion. For example, data for the Great Plains stratum and North American Deserts are roughly balanced without generalized propensity score adjustments (Tables 5 and 6). Adjusting data by GPS score apparently increased the balance of confounding factors, especially for the dataset as a whole (Table 3).

Dose-response relationship
A subsidy-stress relationship was observed between log(TN) and IR in the Western United States (Fig. 3a). The total N concentrations in streams of the 12 states covered a wide range of concentrations (log(TN) from 1.04 to 4.19, or TN from 11 to 15,625 μg/L). Across this wide geographical region, nutrients first affected invertebrate richness positively and then gradually switched to a negative effect, with a breaking point log TN at ca. 1.80 (63.09 μg/L) (Fig. 3b). One stream site with a low total N concentration and the lowest invertebrate richness (Fig. 3a) seemed to be a leveraging data point at the low end of the nitrogen gradient. When the data point is removed, the result did not change significantly (Fig. 4). Additionally, this subsidy-stress relationship was not a result of the polynomial model (Eq. (4)). Using a first order polynomial instead of the second order polynomial as in Eq. (4) resulted in a similar dose-response relationship, shown in Fig. 3a.
For streams in the Northwestern Forested Mountains, a subsidy-stress relationship was observed with an optimal log(TN) similar to the same tipping point from the model for the entire study area (Fig. 5a) but with higher IR at the same nutrient level. For streams in the Great Plains, nutrient levels were high in all streams due to heavy agriculture. As a result, a monotonously negative relationship was observed between nitrogen and invertebrate richness (Fig. 5b). A subsidy-stress relationship The "Grouped by log TN" columns compare confounding factors in the three subsets divided by three continuous intervals of log TN (in brackets), each with about 1/3 of the total samples. The "Adjusted by GPS" columns compare the confounding factors based on GPS in two steps: (1) data were divided based on GPS into six groups and (2) data within each GPS group were further divided into three groups based on the same log TN brackets, and mean differences were calculated. The sample size-weighted averages of these mean differences are shown  Table 3 for explanations was also observed in North American Deserts but with a different optimal N concentration (Fig. 5c). The same negative relationship was present among different ecoregions when nutrient levels were high, suggesting that nitrogen is a stressor to invertebrates when at high concentrations. Among the three ecoregions, streams in the Great Plains had the lowest invertebrate richness when at the same nutrient level; the streams in the Northwestern Forested Mountains had the highest invertebrate richness.

Comparison with direct regressions
At medium to very high N concentrations, the estimation from GAM for the full set of streams was different from the dose-response estimation by the generalized propensity score (Fig. 4). However, among the three sub-ecoregions, the difference between the direct regression and the generalized propensity score (Fig. 5) is smaller than the difference shown in Fig. 4. In the Northwestern Forested Mountains stratum, GAM had fit a subsidy-stress relationship that was only slightly outside the 95% confidence zone of the generalized propensity score estimation when within the middle ranges of N concentration (Fig. 5a). For the North American Deserts stratum, the GAM fit completely within the 95% confidence zone of the generalized propensity score (Fig. 5c). For the Great Plains stratum, the fitted lines from both the simple and multiple linear regressions  closely followed the mean estimation of the generalized propensity score (Fig. 5b).

Discussion
On average, benthic invertebrate richness in streams within the 12 western states of the USA exhibits a subsidy-stress relationship with total nitrogen. This result is consistent with the conclusion that a high level of nutrient concentrations negatively affects invertebrate richness in streams, a conclusion also found in manipulative experiments and observational studies (Tilman 1987;Dodson et al. 2000;Haddad et al. 2000;Wang et al. 2007). Yuan (2010) studied the same data from the EPA by stratifying the data into different groups based on predicted log(TN) concentrations from its covariates instead of the propensity score. The six groups essentially represent different N concentration ranges. Yuan's (2010) observed increases in TN were associated with small increases in IR in the first three groups (low nutrient levels) and a significantly negative correlation between log TN and IR in the last three groups (high nutrient levels), suggesting that there is a subsidy-stress relationship. The average subsidy-stress relationship found over a large region also explains why conflicting results (e.g., no correlation, positive and negative association) have been found in small areas. When streams in a small area have limited nutrients, increases in nutrients can lead to increase in periphyton biomass, which can support a greater diversity of invertebrates (Chetelat et al. 1999); therefore, only positive correlations could be observed. On the other hand, when nutrients are excessive in a stream, the diminished water quality and the depleted oxygen caused by decomposition of periphyton biomass will likely reduce IR (Correll 1998), leading to a negative association between nutrients and IR.
Although the same subsidy-stress relationship was found in the three ecoregions we studied, the variation among regions is also obvious. At a similar nutrient level, streams in different ecoregions have different expected IR (Fig. 5), as nutrient enrichment is only one of the many factors that may influence the diversity of the stream macroinvertebrate community (Wang et al. 2007;Yuan 2010). The generalized propensity score produces an average dose-response function that is controlled by the average condition represented in the data. The three strata in our study have different covariate mean values (Table 2), representing differences in natural conditions and anthropogenic activities; the differences in these conditions and activities contribute to the variability in both the observed nutrient concentrations and the TN-IR relationship. For example, both local-and catchment-scale disturbances can reduce stream macroinvertebrate taxon richness (Ligeiro et al. 2013). The Great Plains ecoregion has the highest percentage of agricultural and urban areas and riparian disturbance. This stratum thus has the lowest IR, even for streams at comparable TN levels with streams in other strata. The characteristics of flow regimes affected by precipitation events are widely recognized as important variables in determining the diversity and community composition in streams (Resh et al. 1988;Poff et al. 1997). Different precipitation patterns in the three ecoregions can produce totally different flow regimes (David 2006) that may change the nutrient-IR relationship (Palardy and Witman 2010). Higher temperatures in the North American Deserts can also increase stream temperature, changing sediments, and dissolved oxygen (Darren et al. 2013) which may increase the excessive nutrient stress to IR (USEPA 1996). Similar effects of other covariates on IR and nutrients may also occur.
The detrimental effects of excess nutrients require controls on nutrient loadings to streams based on Fig. 3 a The regional causal effects of total nitrogen on invertebrate richness in the Western United States. The black solid line and the black dashed lines are the mean estimations by generalized propensity score of the TN-IR dose-response function and the corresponding 95% confidence boundaries, respectively. The gray solid line is the estimation by GAM. b The change rate (slope) of invertebrate richness at different nitrogen levels nutrient effects (Dodds and Welch 2000). Our study provides practical guidance on how to find a scientifically defensible threshold value of TN (e.g., the breaking point log (TN) in Fig. 3b) that will result in no detectable harmful effect on IR. If confounding effects were not approximately controlled, the resulting biased estimate of the causal effect would lead to a biased threshold value. It is therefore important to have reliable estimates of causal effects. The same approach and principle can be applied to other pollutants and ecological indicators. However, as varying natural and anthropogenic conditions exist in different regions, nutrient effects are stream-or region-specific (Fig. 5). As a result, nutrient thresholds for setting environmental standards need to be region/area-specific to increase their value for application. For example, a nationwide nutrient threshold might be higher for some states/provinces but lower for other states/provinces, and the same problem can be extended to different counties within a state/province. However, in practice, the sample size may be limited when the area of targeted regions becomes small, which may increase the uncertainty associated with dose-response estimation. The value of larger-scale studies like the current study can be realized under a Bayesian framework, where an informative prior distribution of the TN effect can be derived (Gelman and Hill 2007). This use of the Bayesian approach is consistent with the interpretation of a Bayesian prior distribution representing the among-site variability (Qian et al. 2015). Local agencies can establish a process whereby they gradually update the dose-response function when new region-specific data are available. This Bayesian updating approach will gradually move from the larger scale average dose-response relationship towards a locally region-specific dose-response relationship.
An important assumption of the generalized propensity score approach is the "weak unconfoundedness" assumption, which states that all important confounding factors are included in the treatment assignment model (i.e., Eqs. (3) and (4)). This assumption suggests that we should collect as many covariates as possible when designing an observational study to guard against missing any important confounding factors. We have identified and included all important confounders available from the wadeable stream assessment dataset, which was created by the EPA (Table 1). However, we may have missed some potentially important confounding factors that were not observed, such as site history and variability of flow velocity. Nevertheless, we have controlled some of the potential covariates and strengthened the estimation of the nutrient's causal effect on invertebrate richness. Many unobserved confounding factors are likely to be correlated with these observed covariates and are thus partially controlled for as well.
When we started this study, we expected that the dose-response relationship estimated by GPS would be different from that estimated by linear regression and GAM. The difference is indeed obvious when comparing the two estimated dose-relationships using the entire dataset (Fig. 4, top panel). However, the differences were not obvious when we repeated the same modeling approaches in the three ecoregion-based strata. When using the entire dataset, confounding factors are highly imbalanced. This imbalance is shown in the large (absolute value) standardized mean differences of all confounding factors (Table 3). Such imbalances are, however, not as pronounced in the data within the three Fig. 4 Comparison between the estimated effects of total nitrogen on invertebrate richness based on the generalized propensity score with/without the "outlier" stream site (the black point). The black solid line and the black dashed lines are the mean estimations and the corresponding 95% confidence boundaries including the outlier, respectively. The gray line is the mean estimation excluding the outlier ecoregion-based strata (Tables 4, 5, and 6). As a result, ecoregion-based dose-response relationships are relatively similar, with or without GPS adjustment. This result highlights the regional differences in the dose-response relationship, which suggests the value of region-specific nutrient criteria.

Conclusions
The regional average of nutrient causal effects on stream invertebrate richness was estimated using observational data, with confounding effects controlled for using the generalized propensity score. The aggregated confounding effects were removed by integration through the single propensity score. We found a subsidy-stress relationship between nutrients and invertebrate taxon richness across streams both in the Western United States and its sub-regions. This same general pattern varies among ecoregions due to the varying natural conditions and anthropogenic activities. The variation demonstrated that invertebrates respond to the same nutrient levels differently across different conditions.