Skip to main content

Machine learning applied to species occurrence and interactions: the missing link in biodiversity assessment and modelling of Antarctic plankton distribution

Abstract

Background

Plankton is the essential ecological category that occupies the lower levels of aquatic trophic networks, representing a good indicator of environmental change. However, most studies deal with distribution of single species or taxa and do not take into account the complex of biological interactions of the real world that rule the ecological processes.

Results

This study focused on analyzing Antarctic marine phytoplankton, mesozooplankton, and microzooplankton, examining their biological interactions and co-existences. Field data yielded 1053 biological interaction values, 762 coexistence values, and 15 zero values. Six phytoplankton assemblages and six copepod species were selected based on their abundance and ecological roles. Using 23 environmental descriptors, we modelled the distribution of taxa to accurately represent their occurrences. Sampling was conducted during the 2016–2017 Italian National Antarctic Programme (PNRA) ‘P-ROSE’ project in the East Ross Sea. Machine learning techniques were applied to the occurrence data to generate 48 predictive species distribution maps (SDMs), producing 3D maps for the entire Ross Sea area. These models quantitatively predicted the occurrences of each copepod and phytoplankton assemblage, providing crucial insights into potential variations in biotic and trophic interactions, with significant implications for the management and conservation of Antarctic marine resources. The Receiver Operating Characteristic (ROC) results indicated the highest model efficiency, for Cyanophyta (74%) among phytoplankton assemblages and Paralabidocera antarctica (83%) among copepod communities. The SDMs revealed distinct spatial heterogeneity in the Ross Sea area, with an average Relative Index of Occurrence values of 0.28 (min: 0; max: 0.65) for phytoplankton assemblages and 0.39 (min: 0; max: 0.71) for copepods.

Conclusion

The results of this study are essential for a science-based management for one of the world’s most pristine ecosystems and addressing potential climate-induced alterations in species interactions. Our study emphasizes the importance of considering biological interactions in planktonic studies, employing open access and machine learning for measurable and repeatable distribution modelling, and providing crucial ecological insights for informed conservation strategies in the face of environmental change.

Introduction

Among the most uncontaminated places on Earth are the Antarctic continent (Leihy et al. 2020) and the Southern Ocean that surrounds it. Particularly, the Southern Ocean is characterized by cold temperatures, and thanks to the onset of the Antarctic Circumpolar Current and the Antarctic Polar Front, high levels of endemism in a variety of taxa and the evolution of unique physiological adaptations (Peck 2018) have been ensured. Despite the Southern Ocean still being relatively pristine, it is currently under several potential threats, such as commercial fishing, warming, acidification, shifting of ocean fronts (Peel et al. 2019), and reduction of sea ice extent and duration (Chown et al. 2012; Turner et al. 2014). These phenomena may vastly compromise its marine ecosystems, affecting their communities from all points of view (Boyd et al. 2008; Constable et al. 2023).

In an attempt to better preserve these unique ecosystems the Ross Sea Region Marine Protected Area (RSRMPA), was established on the 1st December 2017 (Brooks et al. 2020). Wilderness and protected areas of the world represent the last key areas for maintaining biodiversity and associated ecosystem services on a large scale (Mittermeier et al. 2003; Di Marco et al. 2019). Thanks to their ecological integrity, those areas provide baselines for studying, assessing, and managing potential anthropogenic impacts, either current or future (Cole and Landres 1996). Furthermore, thanks to the “spillover effect” or the export of species on adjacent and non-protected proximity areas, they ensure a positive ecological effect and cross-habitat movement of species (Gell and Roberts 2003; Hixon et al. 2014; González-Herrero et al. 2023). The RSRMPA was specifically designed to protect and maintain the function and structure of marine ecosystems. Among these are included those areas known to be important for the life cycles of important commercial species such as the Antarctic toothfish (Dissostichus mawsoni Norman, 1937) (Atkinson 1998; Turner 2004; Parker et al. 2021) and the Antarctic krill (Euphausia superba Dana, 1850). Currently, the RSRMPA is the largest and most productive MPA of the whole Southern Ocean (Arrigo et al. 2008). However, the large geographical scales considered combined with the objective logistic difficulties of sampling at high latitudes made it difficult to study and establish solid baselines. Despite these intrinsic difficulties, it is nonetheless necessary and mandatory to base any conservation or monitoring effort on the availability of robust data. Only this kind of information will guarantee robust estimates of potential future changes in the composition and the structure of food webs up to the whole marine ecosystems (Smetacek and Nicol 2005; Clarke and Harris 2003; Arrigo et al. 2002; Fraser et al. 2023). So far, the biodiversity analyses were generally conditioned by large geographical sampling bias deriving from the nature of sampling efforts, typically sparse and patchy (e.g. Peel et al. 2019). Luckily, innovative techniques such as machine learning (ML) and artificial intelligence (AI) represent powerful tools to increase and refine our understanding of complex non-linear patterns and our predictive power (Humphries et al. 2019). In the case of Antarctic biodiversity, these innovative techniques were already used to implement Species Distribution Models (SDM) on a variety of taxa, such as copepods (Pinkerton et al. 2010a, b; Grillo et al. 2022), echinoderms (Guillaumot et al. 2016), krill (Lin et al. 2022), fishes (Yates et al. 2019), birds and mammals (Huettmann and Schmid 2014; El-Gabbas et al. 2021). In the context of SDM, it is of pivotal importance for the concept of ecological niches, or the environmental conditions necessary for a species to survive and maintain its populations in a given habitat (Colwell and Rangel 2009). Specifically, the ecological niche can be visualised as a ‘hyper-volume with n-dimensions’, characterised by environmental descriptors that allow for the presence of an organism in that particular habitat (Hutchinson 1957; Steiner et al. 2023; Huettmann et al. 2023).

Modelling niches in polar marine ecosystems are remarkably difficult due to the complex interplay of several complex environmental drivers. For example, the sea ice extent and seasonality (Atkinson 1998; Atkinson et al. 2004; Norkko et al. 2007; O’Driscoll et al. 2011) condition the release of food substances and regulate the available underwater light (Dayton et al. 1986; Clarke 1988), supporting sympagic communities (Garrison et al. 1986; Guglielmo et al. 2000; Granata et al. 2022; Swadling et al. 2023) and regulating the abundance of key taxa such as the Antarctic krill (e.g. Atkinson et al. 2004). The release of organic matter contained in the sea ice, in particular, fertilizes and regulates phytoplankton blooms that, in turn, trigger the zooplankton swarm in the late Austral spring and summer (Meiners et al. 2012; Schnack-Schiel and Hagen 1995; Smith Jr and Nelson 1990; Alexander and Niebauer 1981; Cau et al. 2021).

The aim of this investigation is to quantitatively explore and assess, for the first time, the possible interspecific relationships that occur between different Antarctic planktonic taxa while also considering marine environmental descriptors. This type of result also identifies species and environmental characteristics that may influence their relative community distributions.

Predictive maps of the main phytoplankton assemblages prevalent in Ross Sea waters were created for the first time and additionally, for copepod species that are common and abundant in the same marine area.

This enables to scale-up and clarify possible predictive distributions of phytoplankton and mesozooplankton in the whole Ross Sea area, also identifying species and features that may influence their distributions. In order to achieve this we implemented the approach of Breiman (2001) to deduce the three-dimensional distributions from predictions for a large scale, instead of using only field data or model adaptations and distorted and poorly performing parsimonious models (e.g., Guthery 2008; Elith et al. 2006).

Materials and methods

Field data and study area

The study area is a large portion of the Ross Sea, spanning from near the Drygalski Ice Tongue in Terra Nova Bay (TNB) to the Central Basin (Fig. 1). Specimens were obtained in the framework of the XXXIInd Italian National Antarctic Program (PNRA) expedition “P-ROSE” (Plankton biodiversity and functioning of the Ross Sea ecosystems in a changing Southern Ocean, PNRA16_00239, Melchiori 2017). Sampling activities were carried out during the Austral Summer 2017 on the R/V “Italica”. The P-ROSE research aimed at identifying signals and/or response patterns of the planktonic compartment to climate change in progress. During the expedition, a series of zooplankton and micronekton samplings down to a depth of 200 m were carried out, with vertical fishing employing a WP2 standard net for quali-quantitative analysis. Various environmental data such as water temperature (°C), salinity (PSU), fluorescence (Chl-a μg/l), oxygen concentration (mg/l), conductivity (S/m), density (kg/m3), potential (θe), depth (m) and pressure (dB) were also recorded at each of the sampling stations with the use of a CTD system. The zooplankton and micronekton samples were fixed with formalin neutralized 4% fixative solution. Specimens were later transferred from a 4% formalin fixative solution to a 96% ethanol solution and are now present in the collections of the Italian National Antarctic Museum [MNA, Section of Genoa (Schiaparelli et al. 2019)] stored in 96% ethanol. In addition to the P-ROSE data, further data on other zooplanktonic taxa such as microplankton and phytoplankton collected in the Ross Sea sector were included in this study and submitted for analysis.

Fig. 1
figure 1

Study area with stations sampled and prediction area

Additional data from available literature were also digitalized and used in the analysis to cover other groups. Phytoplankton data were obtained from Bolinesi et al. (2020) and Cordone et al. (2022), while microzooplankton data from Monti-Birkenmeier et al. (2022). The individual datasets are described and available in the supplemental materials respectively. Both studies conducted an identification of planktonic organisms down to the lowest taxonomic level. The sampling for these identifications took place during the same Antarctic study campaign at the designated analysis stations. Details are provided in Appendix A.

Data processing

We used the OpenSource R software (version 4.3.2; Windows 10) as well as OpenGIS Quantarctica layers package (Matsuoka et al. 2021) to explore, visualize, and map data as well as model predictions with basemaps. We used the projection of geographic WGS84 in decimal latitude and longitude (6 decimals), for data exploration, and the stereographic Antarctic projection WGS 84/Antarctic Polar Stereographic EPSG: 3031 for map display.

To obtain other values of the environmental descriptors present on the Polar Macroscope Layer, the points with the presence and absence data were superimposed and extracted from the attribute table using the “extract multiple values in points” (GIS) function. The use of these environmental layers has already been employed in the field of SDM, e.g., Huettmann and Schmid (2014), Koubbi et al. (2014), and Grillo et al. (2022). The final table (“data cube”) obtained at the end of the process was used to model-predict the distribution of copepods and phytoplankton’s assemblages with machine learning and its subsequent modelling and representation on Quantartica. We employed a point lattice with a resolution of 1 km (Appendix B) to establish a prediction grid. This grid was overlaid and assessed using each of the machine learning methods, leading to the creation of the final prediction surface.

Correlation matrix: field data

To visualize possible biological interactions through correlations between phytoplanktonic, microplanktonic and zooplanktonic taxa, we used abundance data transformed into presence (‘1’) and absence (‘0’) values, together with oceanographic data recorded in situ (such as pressure, conductivity, temperature, fluorescence, salinity potential, oxygen density, etc.) and the environmental descriptors listed in Table 1, were used to perform this type of analysis. The jSDM package (version 4.3.2) (https://cran.r-project.org/web/packages/jSDM/index.html) (Warton et al. 2015) was used to perform this analysis.

Table 1 Environmental descriptor sampled as OpenSource GIS Quantarctica layers, geotiff format

Modelling: predictions

Step 1: To obtain a good and more robust signal extracted from the data cube and its selected models, “data cloning” was carried out, since the higher number of columns vs rows would have inhibited a solid fit. This technique, described by Lele et al. (2007), requires the rows of the matrix to be copied so that the actual model achieves a better fit on the additional data. This is often possible due to the subsampling methods employed in the ML methods used. In this way, the obtained signal can become more clear and robust (see Jiao et al. 2016 for an application and details). Our dataset originally had 61 rows. The structure was repeated (rows copied and pasted) three times, thus obtaining a dataset of 183 rows and 90 columns with 28 predictors.

Step 2: Salford Predictive Modeler 8 (SPM, https://www.minitab.com/en-us/products/spm/) was used to obtain the predicted distributional values.

Four predictive distribution models were created in SPM: TreeNet (TN), RandomForest (RF), CART Decision Tree (CA), and a combined model, i.e., Ensemble (EN), as previously done by other Authors (Meißner et al. 2013; Hardy et al. 2011; Huettmann and Schmid 2014).

The TN, RF, CA, and Ensemble models are widely acknowledged as among the most powerful machine-learning methods in ecology studies (Breiman 2001; Friedman 2002; Hegel et al. 2010; Mi et al. 2017). For more details on TN, RF, and CA in SPM and their performances, we refer readers to the user guide document online (https://www.salford-systems.com/products/spm/userguide), while EN models averaged values of the former three model results.

The validation method used for the selected models is V-fold cross-validation (CV) for TN and CA, while for RF, fraction of cases selected at random was used.

For CV, the entire dataset is used for learning purposes and then is partitioned into ten bins. At each fold in tenfold CV, nine bins are used as a training set and the remaining bin is used as a test set. After all tenfolds are completed, the test results from each fold are averaged to get a fair test estimate of the all-data model performance (https://www.minitab.com/en-us/products/spm/user-guides/).

The model is using approximately 20% of the data for testing. Final results are reported for both the training and testing data, and the percentage can be modified (https://www.minitab.com/en-us/products/spm/user-guides/).

From the algorithms, a Relative Index of Occurrence (RIO) was obtained for the lattice to show the suitable habitats for the copepods and phytoplankton assemblages. Furthermore, for each model, one depth assemblage was analysed 0–200 m combined.

The RIO is a relative index concerning the “occurrence” category and can assume values between low and high, usually between the range of the training data units, e.g., 0 and 1. RIO values for these analysed points were extracted to evaluate how the predictions correspond to the independent field data and for the respective depth assemblage investigated. The accuracy of the selected models is given by the area under the curve of the Relative Operating Characteristic (ROC).

SPM also analysed the scores of the Variable Importance Plots, which represent the relative importance of variables (VIP). The importance of the predictor variables in the models are assessed by permuting the predictor variables individually in the test data. The reduction in prediction accuracy is then measured by comparing the models calculated using the permuted data with those obtained from the original data. If model accuracy decreases with the permuted variable, this indicates a strong association of that variable with response (Liaw and Wiener 2002).

Step 3: Correlations between the predicted (generalized) RIO values of copepods species and phytoplankton taxa for the entire study area were analysed. This was initially done in R (https://www.r-project.org/) with the “Performance Analytics” package (version 4.3.2) (https://cran.r-project.org/web/packages/PerformanceAnalytics/index.html).

To obtain those large-scale community structure estimates reflecting the study area, we also described correlations in the predicted distribution values from of the Ross Sea lattice grid. We repeated this assessment with the “varclust” function in the HMISC library (version 4.3.2) (https://cran.r-project.org/web/packages/Hmisc/index.html). Subsequently, hierarchical clustering was performed by using the Pearson index (ρ2) as a measure of correlation. Hierarchical clustering was achieved through the use of the “chart.Correlation” function of the PerformanceAnalytics library (version 4.3.2) (https://cran.r-project.org/web/packages/PerformanceAnalytics/index.html). This clustering process was performed with the aim of analysing the RIO values and identifying the ranks formed.

Step 4: The RIO values for the Ensemble maps were obtained by averaging the respective index values (RIO) from the TN, RF and CA. It resulted into an average RIO, obtained by combining the best possible ML algorithms available. This lattice point grid was then scored in SPM using the pattern created from the points of the presence/absence matrix. Within this matrix, two values, i.e., “0” for species absence, or “1” for species presence, have been assigned based on a threshold, which represents the RIO that a given point contains as forecast of the RIO. The accuracy of the selected models is given by the area under the curve of the Relative Operating Characteristic (ROC), following the criteria of Swets (1988) and Pearce and Ferrie (2000).

Using the Inverse Distance Weighting (IDW) tool present in Quantartica, 48 predictive surfaces were generated from the scored lattice showing relative RIOs of zooplankton and phytoplankton over the entire study area (beyond the lattice point location) in order to create a predictive surface grid. These are the first quantitative and repeatable estimates for the study area offering themselves for assessment and improvement over time.

In detail, six assemblages of phytoplankton and six copepods (Table 2) were chosen to obtain a more specific view about the plankton community. Four filter feeders (Paralabidocera antarctica (Thompson I.C., 1898), Calanoides acutus (Giesbrecht, 1902), Metridia gerlachei Giesbrecht, 1902, Ctenocalanus citer Heron & Bowman, 1971) (Hoshiai et al. 1987; Michels and Schnack-Schiel 2005), one ambush feeder Oithona similis Claus, 1866 (Kiørboe et al. 2009), and one predator (Paraeuchaeta exigua (Park 1994)) (Michels and Schnack-Schiel 2005) were analysed for copepods, while for phytoplankton we used Chlorophyta, Cryptophyta, Dinophyceae, Prymnesiophyceae, Bacillariophyceae and Cyanophyta. Appendix I includes ISO-compliant metadata, ensuring that the provided data adheres to globally recognized standards for consistency, interoperability, and usability across various research and data management systems.

Table 2 Taxa list

Results

Training field data

First correlations between plankton were analysed using the field data. Figure 2 shows the graduated scale of correlation value that represents a proxy for biological interactions, with the intensity of colours in the correlation plot representing negative (blue, coexistence) and positive (red, biological interaction) interspecific interactions (Faust and Raes 2012). A total of 1830 values (without 1 values) were obtained, divided into: 1053 positive values (min = 0.01; max = 0.90), 762 negative values (min = − 0.01; max = − 0.83) and 15 zero values. Phytoplankton and microzooplankton show numerous negative values, particularly among zooplankton categories encompassing suspension feeders, filter feeders, and microplankton.

Fig. 2
figure 2

Correlation matrix in plankton community of Ross Sea sector estimated under co-occurrence model with marine environmental descriptors

Primary consumers, such as phytoplankton, show negative values towards omnivorous, carnivorous zooplankton and microzooplankton community. Within the various phytoplankton assemblages, some positive values can be seen, such as those among Chlorophyta and Cryptophyta (0.90), while the most negative value are among Haptophyceae and for zooplankton is Paralabidocera antarctica (− 0.75).

In the microzooplanktonic community, we observe positive values between the genera Codonellopsis sp. and Laackmanniella sp. (0.86), whereas the lowest correlation value was found Laackmanniella sp., Dactyliosolen sp., Cymatocylis sp. Codonellopsis sp. versus Pseudo-nitzschia spp. and Thalassiosira spp. (− 0.83). Zooplankton show many cases of positive and negative correlations. In zooplankton, the highest values are found among the Ctenocalanus citer and Cyanophyta (0.76), while the lowest value is among the genus Calanoides sp. and Cyanophyta (− 0.80). All correlation values and details can be found in Appendix C.

Predictions

We were able to compile a value-added data cube, explicitly in time and space, consisting of copepod species, phytoplankton assemblages and environmental descriptors to be used for model predictions of distributions for the wider Ross Sea wilderness area.

Most of the obtained ROCs (Table 3) showed an accuracy higher than 60%. This means that the analysed models perform well with moderate accuracy, but can be improved also. In particular, the highest ROCs occurred in the copepods Paralabidocera antarctica (86%—CART; 82% TreeNet) and Paraeuchaeta exigua (81%—TreeNet), while, in the phytoplankton assemblages they were Cryptophyta (77%—CART) and Cyanophyta (76%—CART and 74%—TreeNet). Meanwhile, the lowest ROC values, in copepods, occur in Calanoides acutus, Ctenocalanus citer, Oithona similis, and Metridia gerlachei species (27%—RandomForest). In the phytoplankton assemblages the lowest value was obtained in the Chlorophyta (54%—RandomForest). The latter findings await more study. Further details in Appendix D.

Table 3 Species list with sample size and ROC accuracy for both models

Analysis of the importance of variables for each taxa and model, these scores represent the relative importance of the characteristics and thus help to rank the variables. The classifying of variable importance shows how significantly each variable contributes as a predictor variable in the models considered. Here we describe the variables influencing Bacillariophyceae and P. antarctica (Fig. 3). Detailed analyses for each of the other taxa considered can be found in Appendix E.

Fig. 3
figure 3

Variable importance plots for predictive environmental descriptors from CA, RF and TN to predict the presence of Bacillariophyceae assemblages and C. acutus

Concerning Paralabidocera antarctica in the CA model, significant variables include PRESSURE with the highest score of 100%, followed by SEAICE (69.80%), DECIMALLATITUDE (68.25%), and PHOSPHATE_50_M_ (55.98%). In the RF model, we find a broader list of relevant variables, including DECIMALLONGITUDE (13.58%), TEMPERATURE (12.41%), POTENTIAL (10.83%), and CONDUCTIVITY (10.19%). Lastly, in the TN model, there is a more extensive list of relevant variables, including CONDUCTIVITY (56.63%), SEAICE (51.83%), DECIMALLONGITUDE (100.00%), DECIMALLATITUDE (91.04%), and OXYGEN_MG_L (36.17%).

Concerning Bacillariophyceae, in both the CH and TN models, we find several significant environmental descriptors. For example, in CH, we have NOX (100.00%), POX (93.43%), NOX_200 (93.43%), NOX_SURFACE (93.43%), SILICATE_200 (93.43%), SILICATE_SURFACE (93.43%) and PHOSPHATE_SURFACE (92.90%). In TN, we find SEAICE (100.00%), DECIMALLONGITUDE (84.63%), SLOPE (76.01%), SALINITY (73.39%), FLUORESCENCE (72.96%), PRESSURE (70.68%) and OXYGEN_MG_L (69.19%).

Based on the predictions, generalization for the study area, Fig. 3 shows more types of trends between species of copepods and assemblages of phytoplankton, supporting that this organism prefers different Ross Sea zones. Based on the training data and ML with GIS layers, the use of an RIO index enabled us to generalize and categorize the different Ross Sea areas according to their occupation by copepod species.

In Appendix F (Fig. 1), the different correlation values between the RIOs of phytoplankton and copepods are showed. The assemblage of Primnesiophyceae show high values of correlation between Dinophyceae (0.99) and bacillariophyceae (0.91), followed by Dinophyceae showing high values of correlation with bacillariophyceae (0.91) and Chlorophyta (0.78). The lowest values are recorded in the Cyanophyta assemblage, with Cryptophyta showing a value of − 0.43, while bacillariophyceae show a slightly higher value of − 0.41.

For copepods, the correlation values between the predicted distributions were between O. similis, C. citer, M. gerlachei and C. acutus (1.00). P. antarctica shows negative correlations with almost all species, except with P. exigua, which shows a value of 0.25. P. exigua correlates with a value of 0.01. Looking at the correlation values between the two planktonic groups we see different scenarios. The highest values occur between the Cyanophyta and O. similis, C. citer, P. antarctica and C. acutus (0.62), whereas the lowest value is among the same copepod species just mentioned with the assemblage Bacillariophyceae (− 0.079).

The hierarchical clustering of the predicted distributions in the Ross Sea area for the 0–200 m depth class (Fig. 4) shows that the plankton groups have different correlation values. It can be noted that Calanoides acutus, Metridia gerlachei, Oithona similis and Ctenocalanus citer group together. Other groups with good correlation are Bacillariophyceae and Paraeuchaeta exigua, Dinophyceae and Prymnesiophyceae, Chlorophyta and Cryptophyta. This means they are found as an ecological community in the study areas, whereas Paralabidocera antartica and Cyanophyta are less correlated.

Fig. 4
figure 4

Correlations of copepods and assemblages of phytoplankton based of RIO values (Algorithm Ensemble)

Below are just two predictive maps with the corresponding RIO values for the Bacillariophyceae and the herbivorous copepod Paralabidocera antarctica. These two taxa were selected because they are fundamental to the Antarctic marine ecosystem and are considered key species. Consequently, the results of the models with high ROC performance are shown.

A total of 48 predictive distribution maps were generated for various species and phytoplankton assemblages, with details on copepods provided in Appendix G. All RIO values are reported in Appendix H.

In Fig. 5, the presence of Bacillariophyceae is noteworthy that this significant assemblage of phytoplankton is observed in nearly all sampled stations. Concerning habitat suitability, the RIO index displays high values (0.6) across the entire study area. The lowest RIO values (0.49) are identified in the north-eastern part of the sub-Antarctic belt and a few coastal areas. An additional noteworthy observation is found in marine protected areas, where Bacillariophyceae exhibit an increase in RIO values.

Fig. 5
figure 5

Presence/absence points of raw survey location showing a predicted lattice grid summer distribution using the RandomForest algorithm for the depth class 0–200 m of the Bacillariophyceae. For details, see legend

Other assemblages like Dinophyceae and Prymnesiophyceae (Appendix G: Figs. S17–S20 and S21–S24) present high values of presence in the whole area of the Ross Sea but show average low values (0.30) in the neritic zone. In Marine Protected Areas they show high presence values. Chlorophyta and Cryptophyta (Appendix G. Figs. S5–S8 and S9–S12) have average high RIO values (0.30) throughout the Ross Sea area. In general, Chlorophyta show high values in the sub-Antarctic pelagic zone, while Cryptophyta has high values in the sub-Antarctic belt (Appendix G. Figs. SS9–S12). Cyanophyta (Appendix G. Figs. S13 and S16) show high predicted values in the coastal and central areas of the Ross Sea, and low RIO values in the sub-Antarctic area.

Figure 6 shows the distribution of Paralabidocera antarctica, a herbivorous copepod. The original distribution is concentrated in four coastal stations. The predicted distribution shows the medium–high RIO values (0.37) throughout the Ross Sea area. The lowest values (0.33) can be found in the middle of the Ross Sea area. Furthermore, in all marine protected areas, the RIO index has medium values (0.35). M. gerlachei, C. citer, O. similis and C. acutus (Appendix G. Figs. S25, S29, S37, S41) show the medium–high RIO values (0.5–0.6) throughout the Ross Sea area. The lowest values (0.49) can be found, spot-like, in the northwest of the study area. Furthermore, in all marine protected areas, the RIO index has medium values (0.54). Paraeuchaeta exigua (Appendix G, Fig. S42) has high RIO values over the entire Ross Sea area with low values in the marine protected area RS-GPZi.

Fig. 6
figure 6

Presence/absence points of raw survey location showing a predicted lattice grid summer distribution using the CART algorithm for the depth class 0–200 m of the copepod Paralabidocera antarctica. For details, see legend

The boxplots for RIOs (Fig. 7) showed that RA for the phytoplankton assemblages and TN for copepod performed better than the between models analysed.

Fig. 7
figure 7

Quality assessment of the models. a Boxplots of relative occurrence index (RIO) values for three models and the Ensemble model for Antarctic phytoplankton assemblages and the analysed copepod community; b evaluation of the quality of RIO values based on ROC

In phytoplankton assemblages, the model with the lowest RIO values was CA (min = 0, max = 0.16, mean = 0.04), followed by EN (min = 0.12, max = 0.38, mean = 0.27) and TN (min = 0.02, max = 0.58, mean = 0.27), while the RF model showed higher values than the other three models (min = 0.32, max = 0.64, mean = 0.51).

Regarding copepods, the model with the lowest RIO values was once again CA (min = 0, max = 0.24, mean = 0.19), followed by EN (min = 0.12, max = 0.50, mean = 0.39) and RF (min = 0.32, max = 0.66, mean = 0.52). However, the TN (min = 0.04, max = 0.71, mean = 0.46) model showed higher values than the other three but also lower values, as predicted by the CA model.

Discussion

In Antarctic pelagic and coastal ecosystems, seasonal trends in species abundances and diversity are primarily influenced by ice cover duration, seawater temperature, dissolved nutrients, and atmospheric conditions (Pane et al. 2004; Ballard et al. 2012; Cecchetto et al. 2021; Zwerschke et al. 2022). These factors serve as optimal environmental descriptors for species distribution modelling and therefore have been chosen as predictors.

Despite the availability of distributional data for the Antarctic plankton in the whole water column, modelling was so far restricted only to surface communities (e.g., Pinkerton et al. 2010b; Alvarez and Orgeira 2022; Lin et al. 2022) often sampled with the Continuous Plankton Recorder (CPR). This sampling method provided most of the quantitative data available and enabled covering large spatial portions of the Southern Ocean, unfortunately, the output is restricted to the surface layer of the water column.

To our knowledge, no prior attempts have really been made to model and generalize the three-dimensional distribution of plankton throughout the water column. In this study, we have leveraged phytoplankton and copepod distribution data down to a depth of 200 m, along with corresponding Open Access environmental data, to predict the distributions of these groups across the Ross Sea during the Austral summer months.

Before modelling, we have taken into account, here for the first time, the biological interactions that may exist within the Antarctic phyto and micro- and mesozooplankton communities together with the environmental descriptors of the surrounding environment, in order to model something closer to the “real world” data and not just an artificially sliced, predefined subset of species, interactions and variables.

Finally, we made predictions on likely distributions for selected key Antarctic taxa belonging to different trophic groups.

Correlation matrix in plankton community

The first step was to gain an understanding of the existing correlations among different planktonic taxa. Our analyses (Fig. 2) showed three types of interactions: positive (coexistence with other species and/or possible facilitating environmental conditions), neutral (i.e. unaffected by the presence of another species and environmental variables) and negative (i.e. antagonistic with other species and/or possible detrimental environmental conditions) (Morales-Castilla et al. 2015). In our case, negative correlation values are observed for all zooplanktonic taxa which develop their trophic niche under similar environmental characteristics, regardless of their trophic ecology, thus encompassing predators, ambush feeders, suspension feeders, and filter feeders. Conversely, some taxa exhibit positive correlations, indicating the possibility of coexistence situations, as seen in the case of appendicularians interacting (i.e. coexisting) with other filter-feeding organisms. Appendicularians, functioning as specialized microfiltrators (Conley et al. 2018), “by-pass” food competition with other filter-feeding species thanks to their unique mucous feeding apparatus (Katija et al. 2017). Instead, in planktonic taxa occupying the lowest trophic levels, such as phytoplankton and microplankton, negative correlations emerge. This analysis offers a first but crucial glimpse of the potential biological correlations that might exist between planktonic taxa. Through these results, it is then possible to estimate how these taxa are potentially distributed, net of environmental descriptors and biological interactions, thus giving a glimpse of the realistic distributions of those species that sustain the ecological processes in a given sector.

VIPs and hierarchical clustering analysis

The analysis conducted by hierarchical clustering using Spearman’s correlation coefficient (Fig. 4) provided important insights into the relationship between the different ecological niches occupied by the phytoplankton assemblages and copepod species analysed. Chlorophyta, Cryptophyta, Dinophyceae, Prymnesiophyceae, Bacillariophyceae and Cyanophyta were selected due to their contribution in abundance to the Antarctic phytoplankton communities (Biggs et al. 2019) and because they exert a significant role on food web dynamics, biogeochemical cycling and trophic carbon transfer in the marine environment (Finkel et al. 2010; Marañón 2015). The phytoplankton clustering showed the presence of four distinct groups.

In the first group, the co-presence of Chlorophyta and Cryptophyta, suggests significant similarities in the ecological niches occupied. These are mainly influenced by various oceanographic processes (see Fig. 3 and Appendix E), including the stabilisation of the upper mixed layer, sea ice melt and the development of frontal systems (Rodriguez et al. 2002). Similarly, the second group, comprising Dinophyta and Prymnesiophyceae, also shows similarities. Particlarly, these two assemblages are of particular ecological importance as the sea ice variations directly influences their cycles as the melting triggers their relative summer blooms (Selz et al. 2018; Anderson et al. 2018; Stoecker and Lavrentyev 2018). In addition, Prymnesiophyceae play a key role in the trophodynamics of the marine food web and the biogeochemical cycles of sulphur and carbon (Vancoppenolle et al. 2013). Despite the ecological niche of the Bacillariophyceae is alike with those of the Prymnesiophyceae and Dinophyceae, there are important distinctions, placing this assemblage in a third distinct group. In fact, Bacillariophyceae are also influenced by depth, light exposure and the availability of dissolved nutrients, while Prymnesiophyceae are influenced mostly by the upper mixed layer stabilization through ice (Rodriguez et al. 2002). It is relevant to emphasise that the Prymnesiophyceae and Bacillariophyceae show significant abundance peaks during the Austral summer in function of depth (Nuccio et al. 2000). Bacillariophyceae tend to be more coastal, while Prymnesiophyceae are more pelagic (DiTullio et al. 2000; Arrigo et al. 2000). The Cyanophyta stands widely apart from the three other phytoplanktonic groups because of main differences in niche. This class generally inhabits shallow marine environments (Kosiba and Krztoń 2022) and their abundances depend on dissolved nutrients, water temperature and sea province (Karl and Bird 1993; Acosta Pomar et al. 2000).

A similar analysis can be applied to the copepod fauna (Fig. 4). The filter feeders (C. acutus, M. gerlachei and C. citer) (Michels and Schnack-Schiel 2005) and as well as the ambush feeders (O. similis) (Boxshall and Halsey 2004) are grouped into a single cluster. The species compete to feed on the same kind of prey and on suspended organic matter. They are co-existing in the same ranges of environmental variables analysed for their ecological niche (see Fig. 3 and Appendix E). Paralabidocera antarctica, instead, stands apart from the other trophic groups because, in addition to having an epipelagic ecological niche, it feeds mainly on sympagyc algae (Hoshiai et al. 1987; Swadling and Gibson 2000). These trophic-ecological characteristics lead to separate this species from other omnivorous copepod species. P. exigua is in a standing-alone group because of its trophic guild. It belongs to the group of carnivorous predators (Henschke et al. 2015). It is a eurybathic species, but other factors such as temperature, salinity and oxygen availability may also influence its distribution and adaptation in the environment (Hagen et al. 1993; Carli et al. 2000; Hunt and Swadling 2021). In general, the genus Paraeuchaeta is an important link in the Antarctic food chain because, due to the large size of the individuals, it is also preyed upon by secondary consumers such as birds (Bocher 2002).

Species distribution models (SDMs) with ML

All the above-described situations are captured by the distribution maps. In general, planktonic organisms show different distributions according to their trophic niche (Chase and Leibold 2009). In our case, we analysed a “hypervolume with n-dimensions” where environmental descriptors determine presence-absence of an organism in that particular habitat (Hutchinson 1957). In this way, it is possible to quantitatively frame the trophic niches occupied through computational analyses (Elith et al. 2006; Drew et al. 2010). This has been previously done for some common copepod species (i.e. Calanoides acutus, Ctenocalanus citer, Metridia gerlachei, Oithona similis, Paraeuchaeta exigua and Paralabidocera antartica) by Grillo et al. (2022) but in this paper we bring the analysis one step further by analysing the phytoplankton assemblages (Chlorophyta, Cryptophyta, Dinophyceae, Prymnesiophyceae, Bacillariophyceae and Cyanophyta) as well. Furthermore, the sampling was undertaken in more recent expeditions (2016–2017), while in Grillo et al. (2022) data were from the austral summers 1987–1988, 1989–1990 and 1994–1995).

Our results indicate that also in the case of phytoplankton groups, it is possible to find distinct distributions that are picked up by the SDMs. Bacillariophyceae (diatoms) are found widely throughout the Ross Sea (Fig. 5), with significantly high RIO values in inshore areas and areas of maximum glacial cover. The Prymnesiophyceae are also found throughout the Ross Sea, however with more pronounced RIO values in the offshore areas. These observations confirm the ecology of the two phytoplanktonic assemblages, the documentation and description of which are already well-established (Nuccio et al. 2000).

When comparing the distribution of M. gerlachei, one of the most abundant calanoid copepod in the pelagic ecosystems of the Ross Sea (Voronina et al. 2001), with the modelled distribution of the same copepod in Grillo et al. (2022), variations in the predicted distribution emerge. During the 1980s and 1990s, M. gerlachei showed a widespread presence at all coastal stations sampled. As shown in Figures S54–S57 in Supplementary File S1 in Grillo et al. (2022), M. gerlachei was widespread in the Ross Sea area, in marine protected areas and along the entire water column analysed (0–750 m). This species showed high levels of occurrence in the neritic provinces, coastal habitats, sub-Antarctic areas and the Pennel Bank area, with low values recorded in the marine area opposite Victoria Land. In 2016–2017, the distribution of this species remained about the same, but the values of its occurrence, including RSRMPAs, decreased by around 20%. This remarks the importance of establishing baselines for MPAs to compare with future situations.

Effectiveness of the models

Among the models analysed, the RF algorithm produced a very high range of RIO values for both phytoplankton assemblages and copepod community (Fig. 7). Specifically, for copepods, the TN model generated more extreme RIO values, both higher and lower. This confirms that the Random Forest model produces maps that are closer to reality (Mi et al. 2017) compared to the other algorithms used in this study. The RIO values obtained from the CA, in addition to being very low in general, do not seem to accurately represent the actual distribution of the studied species. Ensemble models seem to buffer those issues best (e.g., Hardy et al. 2011).

Our results, based on ROC analysis and boxplot graphs for RIO, suggest that there are differences in the generalization performance of the various modelling techniques (Fig. 7). Generally, the ROC values that generated high RIO values depend on the community analysed. For phytoplankton assemblages, medium–high ROC values predict high RIO values, while for the copepod community, medium–low ROC values predict high RIO values. Ideally, additional evidences for model predictions are to be used beyond ROC-types.

Future perspectives and improvements

The current analyses are constrained by a limited number of sampling points, which were then extrapolated to a larger spatial scale. The authors recognize the potential shortcomings of this design, and it is evident that further refinements are necessary. This could involve incorporating observations from additional years and, if feasible, expanding the sampling stations to enhance the robustness of the inferences drawn. Additionally, utilizing data from the Global Biodiversity Information Facility (GBIF, https://www.gbif.org/) and the Ocean Biogeographic Information System (OBIS, https://obis.org/) could significantly enrich the dataset, providing a more comprehensive understanding and enabling more accurate conclusions. Here we provide first baseline information for wider assessment and improvement.

Nevertheless, the findings remain highly relevant as they provide a first quantitative foundation for subsequent analyses of real-world planktonic communities. Machine Learning emerges as a powerful tool for analysing such datasets characterized by complex non-linear patterns and a high volume of data.

Through this methodology, it becomes feasible to quantitatively estimate the potential 3D distribution of planktonic organisms (Drew et al. 2010; Steiner et al. 2023; Huettmann 2024; Guzzi et al. 2024). Given their pivotal ecological role, supporting the entire marine food web and ecological processes within the Ross Sea, these organisms are of paramount importance (Turner 2004). They serve as the link between dissolved organic and inorganic substances and primary and secondary consumers.

Another notable discovery is that even within the boundaries of the Ross Sea Region Marine Protected Area (RSRMPA), hotspots for phytoplankton species are already present. This underscores the global significance of the Ross Sea and its RSRMPA in contributing to global processes.

Conclusion

The findings of this study emphasize the pivotal role that key copepod species and phytoplankton assemblages play within the Ross Sea and the Ross Sea Region Marine Protected Area (RSRMPA) in maintaining the health of Antarctic marine ecosystems. The demonstrated effectiveness of marine protected areas in preserving primary trophic levels and associated predator populations underscores the global ecological significance of the Antarctic region. This research highlights the need for science-based management practices and provides robust Machine Learning and Open Access based quantitative models that can guide policymakers and conservation biologists. To ensure the continued effectiveness of conservation efforts and to mitigate the impacts of climate change on polar regions, ongoing monitoring and such dedicated research are imperative.

Availability of data and materials

The data and materials used in this work are available at the following link https://zenodo.org/records/11175588.

References

Download references

Acknowledgements

This paper is part of M.G.’s PhD thesis and we are grateful to the field work crews for their data work; further we appreciate the data delivery and awareness of all institutions involved, e.g. the Italian National Antarctic Program (PNRA) and in the specific the project “Plankton biodiversity and functioning of the Ross Sea ecosystems in a changing Southern Ocean” (P-ROSE; PI Olga Mangoni), the UAF and Salford-Minitab (Dan Steinberg) for the software use. FH appreciates CAML, COML ARCOD IPY, OBIS and GBIF initiatives he was able to join and grow in, as funded by SLOAN and others. The work of, and discussions with, D. Ainley and A. Del Rey, are highly appreciated. Further, D. Steinberg (Salford), H. Berrios and E.J. Huettmann supported this work. This is EWHALE lab publication #353. This paper is also an Italian contribution to the CCAMLR CONSERVATION MEASURE 91-05 (2016) for the Ross Sea region Marine Protected Area, specifically, addressing the priorities of Annex 91-05/C. We thank the National Recovery and Resilience Plan (NRRP), Mission 4, Component 2, Investment 1.4—Call for tender No. 3138 of 16 December 2021, rectified by Decree No. 3175 of 18 December 2021, of the Italian Ministry of University and Research funded by the European Union—NextGenerationEU; Award Number: Project code CN_00000033, Concession Decree No. 1034 of 17 June 2022 adopted by the Italian Ministry of University and Research, CUP D33C22000960007, Project title “National Biodiversity Future Center—NBFC”. We thank anonymous reviewers for their precious suggestions and comments that improved the initial manuscript version.

Funding

Sampling was performed in the Italian National Antarctic Program (PNRA) and in the specific the project “Plankton biodiversity and functioning of the Ross Sea ecosystems in a changing Southern Ocean” (P-ROSE; PI Olga Mangoni). The authors are grateful to the Italian National Antarctic Scientific Commission (CSNA) and the Italian National Antarctic program (PNRA) for the endorsement of this initiative and EWHALE lab Inst of Arctic Biology, Biology & Wildlife Department for the financial support.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization, M.G., F.H., and S.S.; methodology, M.G. and F.H.; software, F.H.; formal analysis, M.G. and F.H.; resources, M.G.; data acquisition L.G., A.G.; data curation, M.G., T.D. and S.S.; writing—original draft preparation, M.G., F.H., and S.S.; writing—review and editing, M.G., S.S., D.T., L.G., A.G., and F.H.; funding acquisition, F.H. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Marco Grillo.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

All authors approved the manuscript for publication in ecological processes.

Competing interests

No competing interest exists in this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

13717_2024_532_MOESM1_ESM.zip

Supplementary Material 1. Appendix A. Absence and presence matrix with related environmental descriptors. Appendix B. Lattice grid of Ross Sea with environmental descriptors with rasters utilized. Appendix C. Correlation matrix values filed data and figure. Appendix D. Details models. Appendix E. VIP analysis values. Appendix F. Figure correlation plot with RIO values. Appendix G. Predictive maps for each copepod species and for each phytoplankton assemblages analysed. Appendix H. Predictor layers for each copepod and phytoplankton assemblage with the RIO-prediction for TreeNet, RandomForest, CART Decision Tree and Ensemble models. Appendix I. ISO-compliant metadata.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Grillo, M., Schiaparelli, S., Durazzano, T. et al. Machine learning applied to species occurrence and interactions: the missing link in biodiversity assessment and modelling of Antarctic plankton distribution. Ecol Process 13, 56 (2024). https://doi.org/10.1186/s13717-024-00532-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13717-024-00532-6

Keywords