Species Distribution Modelling performance and its implication for Sentinel-2-based prediction of invasive Prosopis juliflora in lower Awash River basin, Ethiopia

Species Distribution Modelling (SDM) coupled with freely available multispectral imagery from Sentinel-2 (S2) satellite provides an immense contribution in monitoring invasive species. However, attempts to evaluate the performances of SDMs using S2 spectral bands and S2 Radiometric Indices (S2-RIs) and biophysical variables, in particular, were limited. Hence, this study aimed at evaluating the performance of six commonly used SDMs and one ensemble model for S2-based variables in modelling the current distribution of Prosopis juliflora in the lower Awash River basin, Ethiopia. Thirty-five variables were computed from Sentinel-2B level-2A, and out of the variables, twelve significant variables were selected using Variable Inflation Factor (VIF). A total of 680 presence and absence data were collected to train and validate variables using the tenfold bootstrap replication approach in the R software “sdm” package. The performance of the models was evaluated using sensitivity, specificity, True Skill Statistics (TSS), kappa coefficient, area under the curve (AUC), and correlation. Our findings demonstrated that except bioclim all machine learning and regression models provided successful prediction. Among the tested models, Random Forest (RF) performed better with 93% TSS and 99% AUC followed by Boosted Regression Trees (BRT), ensemble, Generalized Additive Model (GAM), Support Vector Machine (SVM), and Generalized Linear Model (GLM) in decreasing order. The relative influence of vegetation indices was the highest followed by soil indices, biophysical variables, and water indices in decreasing order. According to RF prediction, 16.14% (1553.5 km2) of the study area was invaded by the alien species. Our results highlighted that S2-RIs and biophysical variables combined with machine learning and regression models have a higher capacity to model invasive species distribution. Besides, the use of machine learning algorithms such as RF algorithm is highly essential for remote sensing-based invasive SDM.


Introduction
Globally, the invasiveness of invasive alien plant species has become a great risk as it adversely affects ecological services and socio-economic systems (Rajah et al. 2019;Paz-Kagan et al. 2019). The distribution and subsequent socio-economic loss in East Africa are also increasing at an alarming rate (Landmann et al. 2020). Despite their current distribution and negative consequences, studies also daunt an expected large increase in their invasion and will adversely affect uninvaded areas (Howard 2019). Hence, monitoring the adverse impacts of invasive alien plant species using contemporary technologies before its dispersion has paramount importance (Rajah et al. 2019;West et al. 2016). This is particularly vital for developing countries with no or little financial and technical capabilities to avert the invasion and any delay further aggravates the problem (Pyšek et al. 2012;Vilà et al. 2011).
Prosopis juliflora (hereafter Prosopis) is one of the ten worst invasive species that adversely affect millions of hectares of land in many arid and semi-arid regions (Ilukor et al. 2016;Rembold et al. 2015;Shackleton et al. 2014). In Ethiopia, about 1.17 million hectares of land are currently invaded by Prosopis only in the Afar region, which results in approximately 602 million US dollar loss of ecosystem service (mainly due to Prosopis expansion) (Shiferaw et al. 2019b). Its social, economic, and ecological adverse impact in the area will also be expected to increase as the species is aggressively increasing at a rate of 8.3% annually (Shiferaw et al. 2019b). Hence, the best and recommended way is to control its invasion early by anticipating suitable habitats for its diversification using timely and cost-effective tools West et al. 2017).
Mapping the current invasion and modelling suitable habitat of invasive species has an immense contribution for ecologists and policymakers to control the expansion and its adverse threat (Ayanu et al. 2014;Evangelista et al. 2008;Feilhauer et al. 2012). However, controlling its spread and managing its consequence needs a robust, cost-effective, efficient, and precise monitoring system (Lopatin et al. 2016). In addition, ecologists are also in need of timely and cost-effective methods to model and predict in advance the distribution of invasive species (West et al. 2017). In this regard, the use of Species Distribution Medelling (SDM) and Geographic Information System (GIS) are among the widely used prediction tools (Bradley 2014). SDM's had been used by ecologists for a long time to predict species distribution (Allouche et al. 2006; Jiménez-Valverde 2014; Lemke and Brown 2012;Wisz et al. 2008). However, models have shown varied performance and no single best model has been identified by studies for different species and environments (Reside et al. 2011).
The selection and implementation of models require great care as inappropriate use of models can affect the accuracy of species prediction (Elith and Leathwick 2009;Elith et al. 2006). Consequently, many studies used more than one model in comparison (Früh et al. 2018;Ng et al. 2018;Stohlgren et al. 2010). Besides, some studies recommend the use of presence-absence models, while others appreciate the use of presence-only models (Elith and Leathwick 2009). Owing to this, many studies compared machine learning algorithms with regressionbased models (Shiferaw et al. 2019c) while others made a comparison among machine learning algorithms (Früh et al. 2018) and others also developed ensemble models from different SDMs rather than relying on a single model (Ng et al. 2018). However, the performances of SDMs are dependent upon the type of application and problem (Bhattacharya 2013), the spatial resolution of environmental variables (Reside et al. 2011), and the selection of environmental predictors (Elith and Leathwick 2009). Owing to this, the use of several models by applying similar environmental variables and methodologies gives confidence for ecologists to judge their results (Fischer et al. 2013;Lemke and Brown 2012).
The use of machine learning algorithms for remotely sensed-based prediction gets recent attention and significantly improves the prediction of invasive species (Benito et al. 2013;Früh et al. 2018). A timely, advanced, and cost-effective approach that integrates remote sensing technology to monitor invasive species risk is highly needed (Rajah et al. 2019;West et al. 2017). This is particularly important in arid and semi-arid regions of developing nations as the cost of survey data and highresolution commercial data is difficult to justify. The possible options in such areas are the use of freely available multispectral data (Jensen et al. 2020). In this regard, S2 data due to its high spatial, spectral, and temporal resolutions provide an immense contribution to monitoring the distribution and spread of invasive species (Martinez et al. 2020;Meroni et al. 2017;Ng et al. 2016Ng et al. , 2017Truong et al. 2017). S2's high spectral resolution allowed ecologists to derive numerous indices (Rajah et al. 2019). The use of these indices is better than raw bands as they can reduce the effect of atmospheric condition and soil background on canopy reflectance (Liu et al. 2005). Among the available indices, most studies usually employed S2-derived Vegetation Indices (S2VIs). Though S2VIs have higher importance, the contribution of other radiometric indices is potentially also very high. For example, soil radiometric indices (Nouri et al. 2018) and biophysical variables have a higher capacity in monitoring and managing vegetation changes Mudereri et al. 2019). Hence, evaluating the contribution of S2-derived RIs is of utmost importance as they incorporate different variables from the soil, water vegetation, and biophysical characteristics of the area.
So far, a number of studies have evaluated the performance of SDMs for mapping and modelling different invasive species. Stohlgren et al. (2010) made a comparison of five individual models for the prediction of invasive species and suggested that the use of an ensemble model significantly improves prediction compared to a single model. Früh et al. (2018) made a comparison on performances of four machine learning models in the prediction of invasive species and recommended the use of an ensemble approach from best-performing models rather than a single or ensemble model from all models. Likewise, Ng et al. (2018) made a comparison among machine learning models and argued that the performance of Random Forest (RF) and ensemble models are highly appreciated compared to other models used in their study. Abdi (2020) made a comparison of four machine learning algorithms using S2-derived variables for land cover classification and concluded that the Support Vector Machine (SVM) performed better than other models. These studies have used either integration of remote sensing and non-remote sensing datasets (e.g., Ng et al. 2018) or used coarse resolution remote sensing data (Stohlgren et al. 2010). Other studies used S2 data for land cover classification (Abdi 2020), mapping, and detecting invasive species distribution (Rajah et al. 2019).
In Ethiopia, few studies were conducted on mapping and prediction of Prosopis. Wakie et al. (2014) assess the distribution of invasive Prosopis using Moderate Resolution Imaging Spectroradiometer (MODIS) data and Maxent model. Besides, Shiferaw et al. (2019c) evaluated the performance of different SDM using Landsat 8 Operational Land Manager (OLI), climate, and infrastructural data. However, comparative studies about SDM using highresolution data in general and S2-RIs and biophysical variables in particular for monitoring Prosopis were scarce.
This study, therefore, aims at addressing the following research gaps and needs: (1) identifying a robust method for modelling remote sensing-based invasive species distribution is highly required; (2) assessing the potential of S2-derived vegetation, soil and water indices, and biophysical variables for modelling invasive species distribution in arid and semi-arid regions of developing countries is highly essential.

Study area and species
The study was carried out in the lower Awash River basin, Ethiopia. It is located between 40.74 to 41.82°longitude and 10.99 to 12.36°latitude (Fig. 1). It covers an area of 9471.5 km 2 with elevations ranging from 240 to 1341 m above mean sea level. In addition, 75% (7103.6 km 2 ) of the study area is found in the desert and 25% (2367.9 km 2 ) is found in arid to semi-arid agroecological zones. It is also part of the great east African rift valley system. According to the National Metrological Agency (NMA) (2020), the mean annual temperature, mean maximum, and mean minimum annual temperature at Dubti station for years between 2002 and 2017 were recorded as 28.1°C, 33.5°C, and 22.6°C, respectively. December and July are the coldest and warmest months, respectively. Furthermore, frequent drought and subsequent famine are the two major characteristics of the area (Mulugeta et al. 2019).
Though pastoralism is the dominant way of life, agropastoralism is also practicing in the area. The stateowned Tendaho irrigation project which covers around 62,500 ha along the lower Awash River basin supports irrigation-based agriculture (Tadese et al. 2019). The cultivation of sugarcane, wheat, cotton, maize, and other vegetables has been also practiced in small-scale agriculture. In the Afar region, Prosopis was introduced in the early 1980s for soil and water conservation (Tilahun et al. 2017). This was part of the then government afforestation initiative to combat drought and desertification (Wakie et al. 2014). Before the invasion of Prosopis, native grasses, forbs, shrubs, and woody plants dominantly covered the area and were an important source of fodder for the locality (Ayanu et al. 2014;Wakie et al. 2014). After the invasion of the species, however, conflict among pastoralists has increased due to resource competition (Ilukor et al. 2016;Mehari 2015;Wakie et al. 2012). The expansion of Prosopis also affected livestock production with an expected loss of about 26 million dollars per year in the region (Ilukor et al. 2014).

Method overview
In this study, we evaluated the performance of six commonly used SDMs using S2-RIs. We computed thirty-five variables from vegetation, soil, water radiometric indices, and biophysical variables. Out of the variables, twelve significant variables were selected using the VIF. TSS, AUC, kappa, correlation, sensitivity, and specificity were used to evaluate model performance. Besides, reference data on the presence and absence of the species were collected using a handheld Global Positioning System (GPS). Also, an ensemble model was developed from best-performed models (except the least performed bioclim model). At last, prediction maps for all individual and ensemble models were produced. Graphical illustrations on the overall methods used for this study are presented in Fig. 2.

Presence and absence data
Georeferenced in situ data were collected with the help of a handheld GPS from both presence (invaded) and absence (uninvaded) points in the dry season between January and February 2020. In this period, Prosopis is highly discriminated from the other tree species as most tree types shed their leaves due to water scarcity (Godoy et al. 2011;Meroni et al. 2017;Xu 2014). A total of 680 points were collected using stratified random sampling in a 10 m by 10 m plot similar to the spatial resolution of S2 data (Arogoundade et al. 2020). Of all reference points, 30% (205 points) were presence while 70% (475 points) were absence data. This share was chosen considering the previous distribution of Prosopis in the area (Linders et al. 2020). Field data were collected throughout the study area and 200 m was the minimum distance among the points.
To evaluate the spatial autocorrelation among observations, Moran's I was used. Accordingly, we found 0.28 of Moran's Index and 2.45 of Z-score, indicating no apparent spatial clustering among the points (Abdulhafedh 2017). To further reduce spatial autocorrelation among points, we used the "Spatially Rarefy Occurrence Data for SDMs (reduce spatial autocorrelation)" function in ArcGIS SDM Toolbox. To get independent validation statistics, 70% of the collected data were used to train models while 30% were used to validate models (Engler et al. 2013).

Satellite image processing and selection of variables
In this study, Sentinel-2B level-2A data was used to produce varied radiometric indices and biophysical variables. Sentinel-2B level-2A product provides geometric and radiometrically corrected images. The data is delivered as Bottom-Of-Atmosphere (BOA) reflectance images converted from the level-1C Top-Of-Atmosphere (TOA) reflectance (Szantoi and Strobl 2019). Its application was tested by Vuolo et al. (2016) and used by different studies (e.g., Arogoundade et al. 2020;Ng et al. 2018) in the areas of invasive species distribution. It can detect the Earth's surface at 10-m, 20m, and 60-m spatial resolutions. Sentinel-2B level-2A product acquired between 19 January 2020 and 28 February 2020 which is concurrent with the field data collection campaign were downloaded from the European Space Agency (ESA) data portal (https://scihub. copernicus.eu/dhus/#/home).
A total of four scenes were required to cover the study area. Pre-processing such as image mosaicking, resampling to a common grid of 10-m resolution, and subsetting were made using freely available Sentinel Application (SNAP) 7.0 software. Maps were produced by using ArcGIS software. A total of thirty-five variables (Table S1): seventeen from vegetation radiometric indices, eight from soil radiometric indices, five from water radiometric indices, and five from biophysical variables were considered (Table S1). To select important variables, we used the VIF correlation approach to reduce multicollinearity problems in the R 4.0 software "sdm" package (Naimi and Araújo 2016). VIF was used by several studies in the areas of invasive species distribution as a tool to select important variables (Ng et al. 2018;Zimmermann et al. 2007). Accordingly, for this study, out of thirty-five variables, a total of twelve important variables (Table 1) with threshold values less than 0.7 were selected (Engler et al. 2013;Zimmermann et al. 2007). Furthermore, the relative importance of variables for all models was computed using the "getVar-Imp" function in the R software "SDM" package (Naimi and Araújo 2016). Then, the weighted mean values of each variable for each model run was calculated and categorized as vegetation, soil radiometric indices, and biophysical variables.

Selection of Species Distribution Modelling
Today a large number of modelling methods are available and can be classified as "profile," "regression," and "machine learning" (Hijmans and Elith 2019). This study Table 1 Description of the predictive variables used in modelling and mapping the current distribution of invasive Prosopis. The variables were categorized under vegetation indices, soil indices, water radiometric indices, and biophysical variables

Variables Short description
Brightness Index (BI) BI is a soil radiometric index that represents the mean of the brightness of satellite images. It is highly associated with soil brightness (Escadafal 1989).
Colour Index (CI) CI is a soil radiometric index that helps to differentiate soils and their development. Higher CI values representing crusted soils and sands while lower CI values indicating a high concentration of carbonates or sulfates (Escadafal 1989).
Canopy Water Content (CWC) CWC is a biophysical variable that quantifies the amount of water in the given area. It is also an essential predictor in the areas of agriculture and forestry (Cernicharo et al. 2013).

Fraction of Absorbed Photosynthetically Active Radiation (FAPAR)
FAPAR is an important biophysical variable that indicates the capacity of the vegetation canopy to absorb Photosynthetically Active Radiation (Fensholt et al. 2004).
Leaf Area Index (LAI) LAI is a key biophysical variable that quantifies the amount of leaf area per unit ground surface (Zhang et al. 2003).
MCARI is a vegetation radiometric index, responsive to chlorophyll variations (Daughtry et al. 2000).

Modified Normalized Difference Water Index (MNDWI)
MNDWI is a water radiometric index promoted to enhance open water features by minimizing the effect of vegetation, soil, and built-up land noises (Xu 2006).

Meris Terrestrial Chlorophyll Index (MTCI)
MTCI is a water radiometric index that estimates the amount of chlorophyll (Dash and Curran 2004).
Redness Index (RI) RI is a soil radiometric index that gives information about soil color variation in a given area. It is an important index to measure soil redness in the arid environment (Mathieu et al. 1998).
Sentinel-2 Red-Edge Position Index (S2REP) S2REP is a vegetation radiometric index that provides information on chlorophyll content and the growth status of vegetation.

Transformed Normalized Difference Vegetation Index (TNDVI)
TNDVI is a vegetation radiometric index that shows the amount of green biomass in a pixel. It has a high coefficient of determination and excellent linearity to vegetation cover (Bannari et al. 2002).

Transformed Soil Adjusted Vegetation Index (TSAVI)
TSAVI is a soil radiometric index developed to minimize the influence of soil brightness (Baret and Guyot 1991). Table 2 Predictive models from machine learning (RF, SVM, and BRT), regression (GAM and GLM), and profile (bioclim) methods and their short description and common pieces of literature that have used these models in modelling invasive species distribution

Models Short description Examples
Random Forest (RF) RF (Breiman 2001), a combination of tree predictors, is the most commonly used machine learning algorithm (Abdi 2020). It is an effective method for predicting species richness and density (Kosicki 2020).
Boosted Regression Trees (BRT) Like RF, BRT is based on a combination of a relatively small number of trees to increase the performance of predictive variables (Elith et al. 2008). It has also the capacity to process several predictors at high predictive accuracy (Gu et al. 2019). evaluates the performance of six commonly used models in the areas of invasive SDM. The models were selected from machine learning algorithms, regression models, and profile methods for comparison reasons (Table 2). Boosted Regression Trees (BRT), RF, and SVM from machine learning models; Generalized Additive Model (GAM) and Generalized Linear Model (GLM) from regression models; and bioclim from profile methods were considered. One ensemble model from all models, except the least performing bioclim model, was also developed.

Model validation and mapping
For validations of the above-listed models, we used the bootstrap replication approach in the R 4.0 software "SDM" package developed by Naimi and Araújo (2016). Out of 680 collected points, 205 (30%) randomly selected points were used as test data to validate models and the remaining 70% were used for the training of models. This step was replicated ten times and their mean values of sensitivity, specificity, TSS, kappa, AUC, and correlation were used to assess the accuracy of the models. The bootstrapping replication method has the potential to offer unbiased predictive accuracy with fairly low variance (Harrell et al. 1996;Lima et al. 2019). Besides, the sensitivity-specificity sum maximization approach was used to select the best threshold. This threshold was recommended as the best approach for the prediction of species distribution (Liu et al. 2005). Binary maps were developed as pixels greater than the threshold represented the presence of Prosopis at different levels of invasion and pixels lower than the threshold indicated the absence of Prosopis in the area for all models. Besides, a correction for over-prediction using clip models by buffered Minimum Convex Polygon (MCP) was made in ArcGIS SDM Toolbox. The buffered MCP as a posteriori method enables the reduction of over-prediction (Mendes et al. 2020). In addition, the ensemble model was evaluated using a weighted mean of all models except the least performing bioclim model. Maps were further classified into "uninvaded," "low-invasion," "moderate-invasion," and "high-invasion" for the ensemble model.

Performances of Species Distribution Modelling
The performances of SDMs using different evaluation techniques are presented in Table 3. Except for the bioclim model, the accuracy of the other models was very high. Machine learning algorithms (RF and BRT) performed better than regression models (GLM and GAM) and profile (bioclim) method for all evaluation techniques. Following RF, BRT, Ensemble, GAM, SVM, GLM, and bioclim performed better in decreasing order. In addition, BRT performed well in AUC and sensitivity, and GAM performed well in kappa and correlation evaluation techniques after RF.
Model accuracy can also be evaluated using the receiver operator characteristics (ROC) curve as it has the capacity to show the proportion of the true presence rate (sensitivity) and the true absence rate (specificity). The ROC curve for all models is presented in Fig. 3. Except in bioclim, sensitivity and specificity scores were very high for all models indicating both invaded and uninvaded areas were well identified and the proportion of correctly classified samples were maximum (Fig. 3).
Furthermore, the ensemble model was used to produce maps showing the invasion of Prosopis at varying levels of distribution. The threshold for the ensemble model was 0.47 (Table 3), and pixels below the threshold were considered as "uninvaded" of Prosopis and pixel values above the threshold were further divided into three classes as "low," "medium," and "high" invasion of Prosopis distribution (Ng et al. 2018). Accordingly, 86.8% of the study place was not invaded by Prosopis distribution. The rest of the study place (13.2%) was invaded by Prosopis at different levels of invasion as low (2.5%), medium (3.2%), and high (7.5%) distribution (Fig. 5).

Relative contribution of predictor variables
The relative influence of predictors is shown in Fig. 6 and supporting information (Table S2). The relative influence of few variables was very high while other variables were found to be insignificant. The relative influence of vegetation radiometric indices (TNDVI, MCARI, MTCI, and S2REP) for RF, SVM, BRT, GLM, GAM, and bioclim were 83%, 65.75%, 74.35%, 75.5%, 54.85%, and 51.95%, respectively. However, the relative importance of water radiometric indices (MNDWI) was the least in all models except in the bioclim model ( Fig.  6 and Table S2).

Implications of SDMs performance for remote sensingbased prediction
Our study highlights the relative performance of SDMs for remote sensing-based prediction of invasive species distribution. In the present study, the higher performance of machine learning algorithms (RF, BRT, and SVM) and regression models (GLM and GAM) were observed. Among all models, the bioclim model performed worst. The result of our study varied from 19.75% (GAM) to 5% (bioclim) prediction. This huge difference between the models' predictions can affect the monitoring of invasive species. According to González-Ferreras et al. (2016), models with AUC values below 0.7 and above 0.9 are considered as "very poor" and "highly accurate," respectively. Besides, models with TSS and kappa values below 0.4 and above 0.8 are considered as "poor" and "excellent," respectively. Based on the above evaluation techniques the performances of all models, except bioclim, were in the category of "excellent" whereas the performance of the bioclim model was in the category of "very poor" and "poor." The prediction obtained from the above models, except for bioclim, also agreed with previous studies conducted in the area indicating the performance of SDMs has great implication in providing certain predictions. Moreover, numerous studies highlight the higher performance of RF for remote sensing-based prediction of invasive species distribution after evaluating the performance of several SDMs (Jensen et al. 2020; West et al. 2016West et al. , 2017. A study by Jensen et al. (2020) made comparisons among machine learning algorithms for the prediction of invasive Kudzu vine in the USA using S2 and Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) data. They found a higher performance of RF, neural  four SDMs using remote sensing data for the prediction of invasive cheatgrass and concluded that RF was better performed than GLM and BRT. The higher performance of RF is attributed to its ability to avoid overfitting as it combines and votes the most popular class from several individual trees (Breiman 2001). Due to its higher performance, it is the most flexible and widely applied machine learning algorithm for various field of studies such as land cover classification (Abdi 2020), forest monitoring (Ma et al. 2020), species richness, and density (Kosicki 2020), and invasive species modelling (Ng et al. 2018). In addition, its capability for remote sensing-based studies is also immense as it requires minimum time for satellite image classification (Sabat-Tomala et al. 2020). Furthermore, the bioclim model was the least performed model in our study and was reaffirmed with the study findings of Hernandez et al. (2006) and Guisan Fig. 5 Ensemble model produced from RF, SVM, BRT, GLM, and GAM for modelling and mapping of Prosopis habitat suitability distribution. The gray, black, pink, and red colors describe "uninvaded," "low," "medium," and "high" invasion by Prosopis, respectively et al. (2007). Except for bioclim, all models worked well for S2-based studies for the prediction of Prosopis distribution and can be used in similar environments.

Potential of Sentinel-2 for invasive species prediction
Our study indicated that mapping and modelling the distribution of invasive Prosopis using remote sensingderived variables have an immense contribution to the management of the invasive species. Our study findings agreed with the government report and previous studies in the area. A study by Shiferaw et al. (2019b) mapped Prosopis distribution using Landsat-8 OLI and in situ survey data using RF classifier. They found that 15.4% of the study area was invaded by Prosopis. Their result finding was comparable with our findings which were 14.3%.
In addition, our study is also consistent with the study finding of Shiferaw et al. (2019a, b, c) and the report of the Ministry of Livestock and Fisheries (MoLF 2017). Shiferaw et al. (2019a, b, c) found a 12.33% distribution of Prosopis using remote sensing, climate, and infrastructural variables collected in 2017. Employing 8.3% as a rate of annual increase Shiferaw et al. (2019b), its distribution will increase to 15.4% of the region in 2020. This result is also comparable to our study findings which are 14.3%. Besides, a report by MoLF (2017) described that Prosopis covers 12.6% of the Afar region. Considering the same rate of increase (8.3%), its distribution will increase to 15.75% in 2020 which was also comparable to our study findings. On the contrary, a study by Wakie et al. (2014) using remote sensing-derived variables and topo-climatic variables in Afar predicts the distribution of Prosopis species and concludes that the distribution of species was 3.8%. Its minimum prediction might occur due to the low spatial resolution of MODIS vegetation indices and bioclimatic variables used for their study. Besides, the quality of WorldClim-based bioclimatic variables is uncertain in areas where weather stations are scarce (Deblauwe et al. 2016). Martinez et al. (2020) and Truong et al. (2017) described the importance of contemporary remotely sensed variables for the prediction of invasive species. Sentinel-1 and S2 data have a huge contribution in detecting and mapping invasive species (Rajah et al. 2019). The freely available S2 data with its high spatial, spectral, and temporal resolutions provides important information for species-level monitoring and management Rapinel et al. 2019). Several studies also used S2-derived variables for mapping and modelling of invasive species distribution (Arogoundade et al. 2020;Dube et al. 2020;Ng et al. 2017;Rajah et al. 2019). Dube et al. (2020) compared S2 with Landsat-8 OLI for mapping

Relative influence of variables for different models
Biophysical processors Soil Indices Vegetation Indices Fig. 6 The relative influence of predictive variables (vegetation indices, soil indices, and biophysical variables) for different models. The blue, red, and gray colors describe the weighted mean influence of vegetation indices, soil indices, and biophysical variables, respectively the distribution of invasive Lantana camara (Verbenaceae) and conclude that the higher performance of S2derived variables is due to its higher spectral and spatial resolution. Moreover, a study by Arogoundade et al. (2020) applauds the use of S2 spectral bands, S2-derived vegetation indices, and their combination for modelling invasive species. In addition, they also pointed out that red edge S2 bands have a huge contribution to mapping and modelling invasive species distribution. Among S2-derived variables, our study identified the higher influence of vegetation radiometric indices over soil indices and biophysical variables (Fig. 5). Vegetation indices derived from satellite remote sensing data are recognized as a reliable source of information for monitoring vegetation changes (Feilhauer et al. 2013;He et al. 2015;Maschler et al. 2018;Nouri et al. 2018;Teillet et al. 1997). Several studies also confirmed the higher importance of vegetation indices over other variables such as infrastructure, bio (climate), and remote sensing spectral bands. A study by Shiferaw et al. (2019c) illustrated that the Normalized Difference Vegetation Index (NDVI) had higher importance than other variables for invasive species prediction. Similarly, a study by Wakie et al. (2014) employed remote sensing and topo-climatic variables to map and model invasive Prosopis distribution in the Afar region, Ethiopia, and described that the relative influence of Enhanced Vegetation Index (EVI) and NDVI was better than other variables. Immitzer et al. (2019) also demonstrated that S2-derived vegetation indices enhanced model performance.
Furthermore, a study by Rajah et al. (2019) evaluated S2-based Vegetation Indices (S2-VIs) for detecting and modelling invasive American Bramble (Rubus cuneifolius) and described that the performance of models with S2-VIs was far better than S2-VIs fused with Sentinel-1 Synthetic Aperture Radar (SAR) and S2 optical imagery. Besides, a study by Arogoundade et al. (2020) evaluated the performance of S2-derived vegetation indices for modelling invasive Parthenium hysterophorus in South Africa and confirmed its potential for mapping and modelling invasive species. Moreover, TNDVI and NDVI derived from Landsat-8 OLI showed the higher performances for prediction and description of forest parameters such as density, canopy cover, and basal area (Nouri et al. 2018). Similarly, a study by Musande et al. (2012) evaluated the performance of vegetation indices to discriminate specific crop types and found that the performance of TNDVI was better than other vegetation indices used in their study. However, its limitation to identify water areas and considers it as vegetation cover can significantly decrease the accuracy of the model particularly if the study was conducted in areas where large water bodies are available (Shetty and Somashekar 2013).
However, several studies employed the commonly known vegetation indices, such as NDVI, without evaluating the performances of other variables from vegetation, soil and water radiometric indices, and biophysical variables (e.g., Arogoundade et al. 2020). Considering the different characteristics (benefits and limitations) of radiometric indices and biophysical variables, the use of several indices provides certain predictions as it incorporates different features such as soil, water, and vegetation. Moreover, the performances of radiometric indices varied in different studies. For example, Bannari et al. (2002) described the higher performance of the Transformed Difference Vegetation Index (TDVI) over widely used SAVI and NDVI. They pointed out that TDVI has higher sensitivity to bare soil below vegetation cover that helps to provide necessary information about the specific vegetation parameters. Besides, NDVI has some uncertainties as it is affected by soil reflectance and hence the use of other vegetation indices can reduce this problem (Koller and Upadhyaya 2005).
Besides vegetation indices, the contribution of biophysical variables in monitoring vegetation changes, in our study, was also immense. Mudereri et al. (2019) evaluated the performances of biophysical variables of LAI, FAPAR, CWC, Fraction of Vegetation Cover (FVC), chlorophyll content (Cab), and S2 bands to characterize land cover in semi-arid regions and conclude that both biophysical variables and S2 wavebands have great capability for land cover classification. They also conclude that FAPAR was the best-performing variable and outperforming the other variables used in their study. Also, biophysical variables have a higher capacity for monitoring and managing vegetation changes . Moreover, biophysical variables derived from satellite remote sensing data have a huge contribution to describe forest variables (Schlerf et al. 2005). In particular, S2 data provides an unprecedented option to retrieve biophysical parameters (Brown et al. 2019).
Furthermore, in our study, the relative influence of soil radiometric indices (BI, CI, TSAVI, and RI) was minimum compared to vegetation radiometric indices. According to Nouri et al. (2018), the lower performance of soil radiometric indices was observed in areas where low and high vegetation densities are available. Indeed, low species richness and low diversity of woody vegetation are the major characteristics of our study area (Ilukor et al. 2016). Therefore, evaluating the performance of several indices beyond the commonly used indices is necessary for mapping and modelling invasive species distribution (West et al. 2017). However, in our study, the relative influence of TNDVI was far greater than other vegetation and soil radiometric indices. Hence, our results would have benefited if it includes other bioclimatic variables (Ahmed et al. 2020). However, acquiring these variables at high resolution was difficult in the study area.

Conclusion
Our study describes the distribution of invasive Prosopis in the lower Awash River basin, Ethiopia, using machine learning (RF, BRT, and SVM), regression (GAM, GLM), and profile (bioclim) methods. We used S2-RIs and biophysical variables as predictors to evaluate the performances of models. The performance of machine learning algorithms (RF and BRT) was very high. Besides, the roles of regression models (GAM and GLM) were also found to be very high next to RF, BRT, and ensemble models. On the contrary, the performance of the bioclim model was insufficient. Hence, we encourage researchers not to highly depend on the prediction of the bioclim model with S2-RIs and biophysical variables for predictions of Prosopis distribution in the dryland ecosystem. We also encourage researchers to evaluate the performances of models or to use models evaluated previously in related ecosystems and datasets before directly employing specific models as the performances of models can create a significant difference. Therefore, the use of several models can provide reliable information and increase the confidence of ecologists in their result findings.
The best-performed RF model predicted 1354.6 km 2 (14.3%) of the study area were invaded by the species indicating more efforts are required to reduce its distribution. Our study also demonstrated that the use of freely available S2 data has an immense contribution for detecting, mapping, and modelling the spatial distribution of invasive species with a high level of precision. In particular, the use of S2-RIs and biophysical variables can provide basic information about vegetation, soil, and water for better spatial modelling of invasive species. Also, the higher performances of S2-derived variables for mapping and modelling invasive Prosopis distribution indicates the use of such datasets are adequate for such type of studies. Moreover, the relative influences of vegetation radiometric indices were very high followed by soil radiometric indices, biophysical variables, and water radiometric indices. We recommend researchers integrate variables from vegetation indices, soil indices, and biophysical variables for modelling invasive species rather than relying on commonly known vegetation radiometric indices.