Abstracts 6th July 22

Wednesday 6th July 10.45-12.45

James Byrne British Antarctic Survey EDS infrastructure, Net zero, Resilience
Alejandro Coca Castro The Alan Turing Institute EDS infrastructure
Alejandro Coca Castro ASAS EDS infrastructure
Erick Chacon-Montalvan Lancaster University EDS infrastructure
Erick Chacon-Montalvan Lancaster University Extremes
Diarmuid Corr Lancaster University EDS infrastructure
Eoghan Darbyshire Conflict and Environment Observatory Extremes
Eleanor D’Arcy Lancaster University Extremes
Rachael Duncan Lancaster University Extremes
Rachel Furner University of Cambridge and BAS EDS infrastructure
Jake Grainger Lancaster University EDS infrastructure, Extremes
Sebastian Hickman University of Cambridge EDS infrastructure, Extremes
Sebastian Hickman University of Cambridge EDS infrastructure, Extremes
Michael Hollaway UK Centre for Ecology & Hydrology EDS infrastructure
Craig MacDonell University of Glasgow Resilience
Peter Manshausen University of Oxford EDS infrastructure, Extremes
David Moffat Plymouth Marine Laboratory EDS infrastructure
Conor Murphy Lancaster University Extremes
Tom Pinder Lancaster University EDS infrastructure
Maria Salama Lancaster University EDS infrastructure
Qingying Shu Lancaster university Land use
Mala Virdee(not presenting) University of Cambridge EDS infrastructure, Extremes

James Byrne (British Antarctic Survey)

IceNet: A deep learning framework for sea ice forecasting.

Global warming has caused vast amounts of Arctic sea ice to melt, with severe impacts for local people and ecosystems. Despite this, forecasting sea ice accurately is a major unsolved challenge. Our team of researchers developed a sea ice forecasting AI system, ‘IceNet’, which pushed the boundaries of forecasting ability and speed [1]. We have since developed a software framework for operating IceNet in real-time, for both poles, on a daily timescale. This framework is a component to the British Antarctic Surveys operational and environmental Digital Twins.

This talk and poster will showcase our end-to-end deep learning sea ice forecasting pipeline, facilitating training and deployment of infrastructure for predicting and generating forecasts. We will demonstrate the key architectural components and their functionality, illustrating how the data pipeline can be integrated with downstream applications flexibly and with scaling in mind. In addition, the situation of this framework within the wider Digital Twin ecosystem will be described.

The operational framework of IceNet demonstrates the value of building on sustainable software [2] design principles, illustrating best practices adaptable to other applications. Through use of effective, sustainable and generalisable design principles, such methods can be used by researchers to developing operational data pipelines for many areas of environmental data science.

[1] [Seasonal Arctic sea ice forecasting with probabilistic deep learning, *Nature Communications*](https://www.nature.com/articles/s41467-021-25257-4)
[2] [James Byrne | SSI Fellow 2022](https://www.software.ac.uk/about/fellows/james-byrne)

Alejandro Coca Castro, The Alan Turing Institute

Environmental Data Science Book: a community-driven resource showcasing open-source and reproducible Environmental science

With the plethora of open data and computational resources available, environmental data science research and applications have accelerated rapidly. Therefore, there is an opportunity for proposing new cyberinfrastructure for compiling and classifying open-source research and applications across environmental systems (polar, oceans, forests, agriculture, etc). Building upon the Pangeo Gallery, we propose The Environmental Data Science book (https://the-environmental-ds-book.netlify.app), a community-driven online resource showcasing and supporting the publication of data, research and open-source developments in environmental sciences. The target audience and early adopters are i) anyone interested in open-source tools for environmental science; and ii) anyone interested in reproducibility, inclusive, shareable and collaborative AI and data science for environmental applications.

Following FAIR principles, the resource provides multiple features such as guidelines, templates, persistent URLs and Binder to facilitate a fully documented, shareable and reproducible notebooks. The quality of the published content is ensured by a transparent reviewing process supported by GitHub related technologies. To date, the community has successfully published seven python-based notebooks: one agriculture-, two forest-, two wildfires/savanna-, one ocean- and one polar-related research. The notebooks consume open-source python libraries e.g., intake, iris, xarray, hvplot for interactive visualisation and modelling from environmental sensor data. In addition to constant feature enhancements of the GitHub repository https://github.com/alan-turing-institute/environmental-ds-book, we expect to increase contributions from other programming languages e.g., Julia and R and host further community activities (collaboration and coworking sessions) towards improving scientific software practises in the environmental science community.

Alejandro Coca Castro, ASAS

The generation of high spatial resolution soil moisture data has mainly focused from either high-resolution satellite observations or by downscaling existing coarse-resolution satellite- and/or process-based soil moisture datasets. Here, we aim to advance in the latter by evaluating the performance of convolutional neural processes (ConvNPs) models to probabilistically downscale soil moisture in the UK using data from a wide range of modalities including off-the-grid and gridded spatio-temporal datasets. We train the models using data from the COSMOS-UK sensor network (~70k observations) and climate variables (including soil moisture) from ERA5 and ERA5-Land reanalysis gridded data at ∼25 km and ∼9 km spacing across the globe, respectively. The target resolution is guided by the 1 km elevation layer from the Global Multi-resolution Terrain Elevation Data 2010 (GMTED2010). We compare the outputs with ERA5-Land soil moisture values and predictions of a trained linear least square regression. While we expect to include other probabilistic models, the comparison against naïve approaches has led to map performance gains and investigating the most optimal hyperparameters and feature configuration of ConvNPs. As a result, the best calibrated model has outperformed almost 70% their naïve counterparts according to the reported error values. While the preliminary results are promissory, further work is being conducted to generate more spatially coherent and consistent downscaled predictions. In this regard, we will improve the model performance by injecting further contextual information driving soil moisture including slope, land cover and use, soil porosity, among others.


Erick Chacon-Montalvan (Lancaster University)

Climate change can impact drastically the habitat suitability for a range of plant species leading to major impacts in agriculture and, consequently, in economic activities. As a way to cope with possible changes, there must be efforts to predict land suitability in future scenarios to be able to optimize land allocation. Land suitability is defined mainly by the intrinsic characteristics (e.g. physical, chemical, climatic) of land and, currently, can differ from land cover (LC) due to human intervention that has not optimized land use. Modeling land suitability is not straightforward because it is not directly observed; land cover (a proxy for land suitability) includes biases due to human intervention; and the definition of a spatial stochastic process that represents adequately land suitability is not obvious. In this paper, we use a Bayesian hierarchical spatial model to capture the variation of land cover that can be related to land suitability, and predict it under future climate change scenarios. Statistically, we define land suitability (LS) as a continuous-multivariate latent process for a set of complementary land types (e.g. arable, wetland, grassland, forest), and use a multivariate spatial process to represent it being able to take into account the spatial correlation between categories. Our hierarchical model, first, defines land cover in relationship of land suitability and land cover conditioned on land suitability, π(LC(s)) =
\int π(LS(s))π(LC(s) | LS(s))dLS(s). Then, we define a model for land suitability depending only on the intrinsic characteristics of land. Finally, the conditional of land cover with respect to land suitability is modeled to isolate human intervention that defines the current urban areas. With this approach, we predict land suitability even in current urban areas and predict all categories for 20-year periods.

Erick Chacon-Montalvan (Lancaster University)

In this paper, we introduce two new model-based versions of the widely-used standardized precipitation index (SPI) for detecting and quantifying the magnitude of extreme hydro-climatic events. Our analytical approach is based on generalized additive models for location, scale and shape (GAMLSS), which helps as to overcome some limitations of the SPI. We compare our model-based standardised indices (MBSIs) with the SPI using precipitation data collected between January 2004 – December 2013 (522 weeks) in Caapiranga, a road-less municipality of Amazonas State. As a result, it is shown that the MBSI-1 is an index with similar properties to the SPI, but with improved methodology. In comparison to the SPI, our MBSI-1 index allows for the use of different zero-augmented distributions, it works with more flexible time-scales, can be applied to shorter records of data and also takes into account temporal dependencies in known seasonal behaviours. Our approach is implemented in an R package, mbsi, available from Github.

Diarmuid Corr (Lancaster University)

Automated mapping of ice sheet supraglacial hydrology using Machine Learning.

Around the periphery of the Greenland and Antarctic Ice Sheets, networks of supraglacial lakes and streams form each summer, in response to seasonal surface melting. The nature, extent and dynamics of these surface hydrological systems are important because it affects the transport of freshwater towards the coast, and can impact upon factors such as ice dynamics. With the launch of operational missions carrying optical sensors, such as Sentinel-2, there is the opportunity to monitor this system at weekly periodicity around the entirety of the ice sheet margins each summer. Given that there are many thousands of these features (~76,000 features identified across Antarctica in January 2017, for example), and they appear in many thousands of satellite images, accurate, automated approaches to mapping these features in such images are urgently needed. However, conventional mapping approaches require extensive manual post-processing to remove false positives, which makes them infeasible within the context of a Digital Twin.Here, we consider more automated approaches, which have been investigated within the 4D Greenland and 4D Antarctica studies, and that are better suited for implementation within a future Digital Twin. Specifically, we evaluate the potential of Machine Learning approaches, including a Random Forest algorithm, which are trained to classify surface water from non-water features in a pixel-based classification. We will assess their performance relative to conventional approaches, including their spatial and temporal transferability, and investigate performance across the margin of both the Greenland and Antarctic Ice Sheets. Our approach, designed for easy, efficient rollout over multiple melt-season, uses optical satellite imagery alone. The workflow, developed under Lancaster University’s High End Computing and Google Cloud Platform, which hosts the entire archive of Sentinel-2 and Landsat-8 data, allows for large-scale application over Greenlandic and Antarctic ice sheets and is intended for repeated use throughout the future melt-seasons. Ice sheets, a crucial component of the Earth System, impact global sea level, ocean circulation and biogeochemical processes. This study shows one example of how Machine Learning can automate historically user-intensive satellite processing pipelines within a Digital Twin, allowing for greater understanding and data-driven discovery of ice sheet processes.

Eoghan Darbyshire(The Conflict and Environment Observatory), Henrike Schulte (Zoological Society of London), Linsey Cottrell (The Conflict and Environment Observatory), Philipp Barthelme (University of Edinburgh)

Monitoring the environmental dimensions of conflicts

War can intensify pre-existing environmental issues and present new and novel problems. These range from discrete incidents requiring rapid assessment e.g. marine oil slicks in Libya – to longer-term environmental change e.g. loss of primary tropical forest in Myanmar. Yet, there is a dearth of robust environmental data in locations currently and recently affected by conflicts. This follows expertise, capacity and will being lost or diminished in academia, civil society, commerce and, in particular, the government departments responsible for environmental administration.At the UK charity the Conflict and Environment Observatory (CEOBS), one of our aims is to help close this environmental data gap through our own research, and through advocating for more environmental study of conflict settings from international organisations, academia, and in-country civil society groups.  Our poster and presentation will exhibit some of our independent and open-source research, highlighting the data science approaches we use and challenges we face across different time-scales, geographies and disciplines:An overview of our database for environmental pollution incidents in Ukraine since the Russian invasion. We are scraping social media to isolate environmentally relevant incidents, then verifying, documenting, archiving and assessing environmental risk via crowd-sourced open-source intelligence.
Woody vegetation loss in Tigray, threatening the nature-based landscape restoration which has improved food security and biodiversity, via a classification and time-series analysis of optical and radar satellite data.
Approaches to understanding the legacy of aerial bombing for tropical forests in Vietnam, including orthorectifying declassified spy satellite imagery and applying machine learning methods for crater detection. This is work from PhD student, Philipp Barthelme, for whom CEOBS are case partners.
Closing the ‘military emissions gap via novel monitoring methods and new reporting frameworks. The sector may emit 5 % of the global total but reporting exemptions and a lack of transparency mire this estimate in uncertainty. This includes quantifying the carbon cost of conflicts too, and helping advocate for sustainable reconstruction – only peaceful societies can be net zero societies.

Eleanor D’Arcy

Coastal flooding poses an increasing risk to coastline communities due to anthropogenic climate change. Extreme sea level estimation requires statistical analysis based on extreme value theory to extrapolate to unobserved levels of the data. We develop a model for sea levels from which we can estimate return levels; the value exceeded with some probability p in a year. We are particularly interested in rare events, where p is small, to assist with for coastal flood defense design. Early methods modelled sea levels directly, but this ignores the known tidal component and results were biased due to assumptions of stationarity being violated. Instead, we consider peak tide and skew surge as the only components of sea levels. Skew surges are stochastic and define the difference between the peak tide and maximum observed sea level within a tidal cycle. They are driven meteorologically, so are typically worse in winter and less extreme in summer. We model extreme skew surges using a generalised Pareto distribution and capture non-stationarity, as well as the dependence on peak tides, by adding covariates to our model. Since peak tides are predictable, we carefully choose our tidal samples to reflect monthly and interannual variations. We show that the estimates currently used in practice can be lower than those estimated using our methodology; underestimation of sea level return levels can lead to coastal defences being breached with devastating consequences.

Rachael Duncan (Lancaster University)

The Kalman filter is a useful tool for analysing and forecasting time series data and has remained popular across various disciplines. One key assumption of the Kalman filter is normality. The multivariate normal distribution is completely characterised by its mean and variance, this allows for the Kalman filtering steps to be performed quickly and efficiently. However, this assumption does not hold in all applications. For example, there may be skew  present in the observations. In this work, we propose a skew normal Kalman filter that follows a multivariate skew normal distribution. Incorporating skewness into the traditional Kalman filter aims to improve their applicability by extending their usefulness to a wider range of data distributions but without compromising their low-cost computational benefits. The skew normal Kalman filter is an extension to the typical Kalman filter which accounts for the skewness present in the observational data to better characterise the data and produce more accurate forecasts. Here we look at air quality as a motivating example to demonstrate our approach.

Rachel Furner (University of Cambridge, British Antarctic Survey), Peter Haynes (University of Cambridge), Dan Jones (British Antarctic Survey), Dave Munday (British Antarctic Survey), Brooks Paige (University College London), Emily Shuckburgh (University of Cambridge)

Developing an emulator of an Ocean GCM

Process-based weather and climate general circulation models (GCMs) currently represent the best tools we have to predict and understand weather and climate evolution. However these models require huge amounts of computing resources. Machine learning offers opportunities to improve the computational efficiency of these models through data-driven emulators.I discuss recent work to develop a data-driven emulator of an ocean GCM. Progress has been made with developing data-driven forecast systems of atmospheric weather, however, despite being a key component of the weather and climate system, there has been less focus on applying these techniques to the ocean. While there are many similarities between the dynamics of the ocean and the atmosphere differences, key differences exist. In particular we focus on the inclusion of land, and the consequent boundary dynamics, within our model domain. This brings additional challenges not yet explored in existing applications.We train a convolutional neural network on the output from a GCM of an idealised channel configuration of oceanic flow. We show that, similarly to existing atmospheric applications, the model is able to learn well the complex dynamics of the system when trained only on interior ocean points – replicating the mean flow and details within the flow over single prediction steps and iterating well over ‘weather’ scales (2-3 weeks). However, when we include land in our domain, our CNN struggles to capture the dynamics of the system and further development is needed.

Jake Grainger (Lancaster University)

Parametric estimation of the frequency-direction spectrum for ocean waves

Understanding the behaviour of wind-generated ocean waves is important for many offshore and coastal engineering activities. The frequency-direction spectrum is important for characterising such waves, and plays a central role in understanding the impact of ocean waves on structures and vessels. Estimating the frequency-direction spectrum is challenging, as the physical process in question is spatio-temporal and continuous, but we usually only observe the sampled 3D displacement of a buoy floating on the surface (a multivariate time series). Existing parametric techniques for recovering the frequency-direction spectrum are good at estimating location parameters (e.g. the peak frequency of waves), but struggle to recover more intricate shape parameters (e.g. the spread of energy around the peak frequency). We demonstrate how, by transforming the model of interest into a model for the recorded series, we can use a multivariate pseudo-likelihood approach to recover the parameters of the model. Our novel method is statistically more powerful and resolves more parameters than the current state-of-the-art, thus providing a better characterisation of the ocean. We demonstrate our methodology on data recorded in the North Sea, focussing on storms and extreme events, which are of high significance to engineers and environmental scientists.


Sebastian Hickman (University of Cambridge)

Can simple machine learning methods predict concentrations of OH better than state of the art chemical mechanisms?

Concentrations of the hydroxyl radical, OH, control the lifetime of methane, carbon monoxide and other atmospheric constituents. The short lifetime of OH, coupled with the spatial and temporal variability in its sources and sinks, makes accurate simulation of its concentration particularly challenging. To date, machine learning (ML) methods have been infrequently applied to global studies of atmospheric chemistry.

We present an assessment of the use of ML methods for the challenging case of simulation of the hydroxyl radical at the global scale, and show that several approaches are indeed viable. We use observational data from the recent NASA Atmospheric Tomography Mission to show that machine learning methods are comparable in skill to state of the art forward chemical models and are capable, if appropriately applied, of simulating OH to within observational uncertainty.

We show that a simple ridge regression model is a better predictor of OH concentrations in the remote atmosphere than a state of the art chemical mechanism implemented in a forward box model. Our work shows that machine learning may be an accurate emulator of chemical concentrations in atmospheric chemistry, which would allow a significant speed up in climate model runtime due to the speed and efficiency of simple machine learning methods. Furthermore, we show that relatively few predictors are required to simulate OH concentrations, suggesting that the variability in OH can be quantitatively accounted for by few observables with the potential to simplify the numerical simulation of atmospheric levels of key species such as methane.

Sebastian Hickman ( (University of Cambridge)

Predicting ozone air pollution in Europe with temporal deep learning

Surface ozone is a significant pollutant worldwide that contributes to approximately 300,000 premature deaths annually, and has a considerable effect on crop yields, with an estimated cost of billions of dollars per year. In light of recent stricter WHO regulations on surface ozone levels, more accurate predictions of ozone air pollution may facilitate improved preventative policy to reduce the risk to humans.

We use observational station data from countries across Europe, retrieved from the TOAR database, to train an uncertainty-aware temporal machine learning model to make predictions of daily maximum 8-hour mean ozone concentration at those stations. Our model, based on the Temporal Fusion Transformer architecture, is able to make skilful reconstructions of daily ozone concentrations using concurrently observed meteorological covariates such as temperature and humidity (MAE = 2.1 µg/m³, R2 = 0.90), and is also able to make skilful short-term future forecasts of ozone levels at lead times of up to 4 days (MAE = 2.6 µg/m³, R2 = 0.81). The model outperforms traditional machine learning methods, such as ridge regression and random forests, and traditional time series forecasting methods such as ARIMA.

Furthermore, by investigating the attention mechanism in the model, we are able to analyse which variables and patterns in the data most inform the model’s predictions. We then compare these findings to known physical mechanisms to determine if the model captures the underlying physical relationships driving ozone concentrations.

Finally, we analyse the performance of our model when predicting extreme high ozone events, which typically occur in the summertime, and when predicting at rural and urban stations. We find that our model performs less skilfully when predicting extreme ozone events, which suggests that further work is required to determine how best to model extrema with transformer-based architectures.


Michael Hollaway (UK Centre for Ecology & Hydrology)

Detecting Spatio-temporal changepoints in environmental datasets.

Changepoint detection techniques are typically used in the environmental sciences to detect local scale events in time series that could be indicative of significant changes in the underlying environmental system. When presented with station data, changepoints are typically detected using a marginal approach in that each station time series is treated in isolation. Here, we present a new approach to changepoint detection, whereby the spatio-temporal relationship between stations is factored into the changepoint detection algorithm. In this case, Generalised additive models (GAMs) are utilised to model this spatio-temporal relationship. The GAM models are used in conjunction with the pruned exact linear time (PELT) changepoint detection method to detect common changepoints in the time series across all locations. An application of this method is demonstrated using UK air quality measurements during March 2020, to detect changes in Nitrogen dioxide (NO2) concentrations associated with the nationwide COVID lockdown.


Craig MacDonell (University of Glasgow)

Despite accelerated rates of coastal erosion and growing coastal populations, global understanding of the relative resilience of communities to coastal erosion is limited yet social justice and climate justice are key emerging issues of concern for governments. For the first time in the UK, using Scotland as an exemplar, this work aims to couple anticipated erosion risk with consideration of the social vulnerability of Scotland’s coastal communities, to produce Coastal Erosion Disadvantage maps. A combination of Dynamic Coast erosion data, the latest Census data from 2011, the latest data from the Scottish Index of Multiple Deprivation (2016 & 2020) and academic and policy literature concerning coastal erosion and flooding vulnerability, were used to create a Social Vulnerability Classification Index (SVCI) using a series of deprivation and context-specific indicators. We report that coastal communities have a slightly higher proportion of more socially vulnerable groups compared with the Scottish average with spatial variations in Coastal Erosion Disadvantage (e.g. East Lothian, South Ayrshire and Argyll & Bute have higher vulnerability). The maps show that under an IPCC High Emissions Scenario (HES RCP8.5), and assuming no future maintenance of coastal defences, 37% of the residential property anticipated to be affected by coastal erosion are within the top three SCVI vulnerability categories. In addition, 67% percent of socially vulnerable properties that are anticipated to be at coastal erosion risk by 2050, are currently undefended. We recommend that this initial assessment is used by planners as a catalyst of further in-depth place-based assessments of social vulnerability to erosion for current and future planned developments in at risk communities, to help society become sea level wise.


Peter Manshausen (University of Oxford)

Time Series Causality for Aerosol Cloud Invigoration Aerosol has been argued (Williams et al. 2002, Rosenfeld et al. 2008) to ‘invigorate’ convective clouds, such as in very energetic thunderstorms. Aerosol would, in polluted regions, decrease cloud droplet radii and therefore delay precipitation. Droplets are transported to higher altitudes than they would be in clean conditions. Here, they freeze and release latent energy. The ice particles fall and melt again in lower regions. This increases the heat transport in the cloud, which in turn means more rainfall for the same amount of convective available potential energy (CAPE). This is the proposed mechanism for convective invigoration. Li et al. (2011) claim to have found evidence of such invigoration in the datasets of the Southern Great Plains Site of the ARM. They show that in mixed phase clouds, cloud-top height and thickness increase with aerosol concentration. They also show that when there is more aerosol, rainfall increases in the case of high-liquid water content clouds. According to the authors, these observations are evidence for convective invigoration. Conversely, Varble (2018) argues that while there is a correlation between aerosol loading and cloud top height, there is no causal link between the two. He shows that meteorological variables, especially the level of neutral buoyancy (LNB) and CAPE, are correlated to both aerosol and cloud top height, and that the addition of aerosol as a predictor does not add to a regression model for cloud top height. They propose that rain could be at the origin of the observed correlations, being impacted by meteorological variables and washing out aerosol. To untangle the causal links, here we use the methodology of time series causality (Runge et al., 2019), in particular the PCMCI algorithm, to elucidate the links between aerosol, precipitation, cloud height, and meteorology. Time series of a wide array of data (Active Remotely-Sensed Cloud Location product, MERGESONDE, condensation nuclei, radiosonde, and Arkansas–Red Basin River Forecast Center hourly rainfall data, among others) are used and fed into the PCMCI algorithm in order to construct a directed acyclic graph representing the causal links inferred from this time series data. By applying causal inference techniques to real world data, we present new perspectives for both the aerosol and the causal inference communities.

David Moffat (Plymouth Marine Laboratory), Katie Awty-Carroll (Plymouth Marine Laboratory), and Daniel Clewley (Plymouth Marine Laboratory)

Supporting Environmental Research with Artificial Intelligence at the NERC Earth Observation Data Acquisition and Analysis Service (NEODAAS)

The application of AI and machine learning is rapidly growing across the environmental research field. State-of-the-art machine learning techniques can be used to analyse and exploit environmental data, to produce greater insight into the current data captured, and enable better understanding of the environment.To effectively exploit the benefits of machine learning, NEODAAS has brought together key resources required to enable environmental research excellence. NEODAAS provides AI expertise, support for development of AI/ML pipelines, and access to specialized hardware to support the environmental research community. Through development services, we work with individuals and external collaborators to create bespoke AI solutions for environmental research problems. Throughout model development and deployment, we provide expert advice and knowledge to support users in identifying the specific challenges of their science area, how best to approach their research task, and which technologies would be most appropriate. We also work with users to identify areas where existing AI models and pipelines can be improved, supporting best practices for model accuracy, applicability, and efficiency. This could be through advice, practical support and implementation or optimisation of pre-existing code. We also run regular training courses focusing on practical applications of Machine Learning to environmental datasets. We have supported a variety of academic research into with enviromental applications including algal bloom segmentation, tree monitoring from remote sensing and terrestrial LIDAR, glacier front detection, vehicle pollution tracking and underwater image analysis.

Conor Murphy, Jonathan Tawn(1), Peter Atkinson(1), Stijn Bierman(2), Ross Towe(2)­, Zak Varty(3)(1)Lancaster University. (2)Shell. (3)Imperial College London.

Spatio-temporal threshold selection for extreme induced seismicity

Production of oil and gas can, in certain circumstances, cause shallow-depth, low-magnitude earthquakes (induced seismicity). The potential impact of high levels of induced seismicity justifies careful modelling of magnitudes to forecast hazards under future extraction scenarios. Anthropogenic earthquakes come with significant modelling challenges due to the complexity of the underlying processes and the difficulties in obtaining reliable measurements. One of these challenges is variable data quality, in the form of partially censored data. This is due to the development of the geophone recording network over time. Missing observations occur in periods where networks were too sparse and insensitive to accurately detect these low-magnitude events. In order to statistically model induced earthquake magnitudes, we use principles from extreme value theory. This provides a framework to model observations above a suitably high threshold using a generalised Pareto distribution. The choice of threshold is a major modelling challenge; too low a threshold results in a biased model fit whilst too high a threshold leads to high parameter uncertainty. Previous work has explored the selection of a time-varying threshold above which the earthquake catalogue may be considered complete. This allows smaller magnitude events, unused in other analyses, to contribute to the understanding of extreme events and incorporates changing data-quality into the subsequent extreme value analysis. The density of the sensor network also varies spatially. We develop a methodology to allow spatial variability in the chosen threshold while still accounting for varying data-quality over time. We compare our automated threshold selection method to existing methods.


Tom Pinder (Lancaster University)

Accurately predicting air quality levels at a fine resolution is a critical task to ensuring the public’s health is not at risk. In this work, we construct a graph representation of the road network in Mitcham, London, and model nitrogen dioxide levels using a Gaussian process defined on the vertices of our graph. We introduce a heteroscedastic noise process into our model to capture the complex variations that exist in nitrogen dioxide observations. Defining our model in this way offers superior predictive performance to its spatial analogue. Further, a graph representation allows us to infer the air pollution exposure that an individual would experience on a specific journey and their subsequent risk. We demonstrate this approach for the district of Mitcham, London.

Maria Salama (Lancaster University)

Research study on the experiences, opportunities, and challenges of Virtual Labs for Environmental Data Science

Virtual Labs provide a collaborative, dynamic and tailorable platform for research in Environmental Data Science. Within the Data Science of the Natural Environment (DSNE) project, we are conducting a research study on the current experiences, barriers and opportunities associated, as well as requirements for future developments. We conducted an online survey distributed to the wider community of DSNE, complemented by semi-structured interviews for more in-depth understanding of the survey findings. Having a balanced distribution of participants with and without experience of Virtual Labs, the results have given indicators on the different uses of Virtual Labs and shown us the rate of facilities, services, and experience of Virtual Labs. The results have also given us good insights on the challenges and requirements for future extension and development. Our future work will be directed towards studying the collaboration aspect of Virtual Labs.


Qingying Shu and Ce Zhang (Lancaster University)

Deep hierarchical classification for recognizing fine detailed crop types using timeseries satellite radar images

Crop identification and mapping using satellite remote sensing techniques is critical for agricultural monitoring and management. Distinguishing crops from satellite sensor image can be challenging given the irregular shape of fields, the complex mixture within smallholder farms, the variety of crops, and the frequent land use changes. The advances in satellite sensor techniques and classification algorithms allow us to acquire timely information on crop types at fine spatial-scales. State-of-the-art research of crop classification involves the joint use of deep learning techniques and clustering methods
Our current research aims to develop a recurrent neural network (RNN) for crop classification using Sentinel-1A time series backscatter images. The objectives of our study are to discriminate a wide variety of crops at fine spatial details and to increase the classification accuracy using timeseries images. To achieve higher accuracy, we set up a hierarchical structure that first selecting arable land, then identifying the crop categories in the area of arable land, and finally recognizing the crop classes within each category. A pilot study was performed on an area of the North-western Germany, for which we obtained the Land use registry across the growing seasons in 2018 as the ground reference data. The crop type labels are provided in a three-level hierarchy: the first level specifies the land use types, such as arable land, the second level includes categories within the land use types. The majority of the study region comprises the arable land of 10 categories, namely oilseeds, commercial crops, fallow land, protein crops, energy crops, vegetables, cereals, Rooted Fruits, Herbs and Spices, and ornamental plants. The third level details the crop classes within each category. 39 crop classes exist in these categories. The major crop classes identified include barley, rapeseed, rye, wheat, potatoes. We extract time series pixel values of the crop classes and categories, and then split them into training and testing data sets. Our modelling approaches are developed based on the Long Short-Term Memory (LSTM) deep learning models, which transform the temporal and dual-polarization input features intosequential hidden states, generate the output with scores, and then predict the crop types. We then compare the results from the LSTM and the hierarchical approach.


Kate Wright

Data to Decision: uncertainties in environmental data science 

Uncertainties are an inherent feature of scientific research; these can be ‘aleatory’ due to the random nature of the world; or ‘epistemic’ due to limited knowledge or ignorance, and thus could be reduced with further research.  Alongside these, language ambiguities can occur due to the collaboration of researchers with different disciplinary backgrounds. Unquestionably, scientific uncertainties impact on actions taken by stakeholders, and additionally, researchers need to consider that interpretation of – and response to – uncertainty differs between individuals. Understanding the many sources of uncertainty along the data to decision pipeline will aid provision of robust scientific evidence to underpin decision-making. This evidence, accompanied by transparency of uncertainties, will enable the decision maker to understand the level of risk they are taking. Grounded in data collected from interviews and focus groups, this poster will discuss the uncertainties experienced by experts from environmental science, computer science, and statistics, to provide a new typology of uncertainty for environmental data science. 

This work is being carried out as part of the Data Science of the Natural Environment (DSNE) Project at Lancaster University. 

Mala Virdee

Should global climate models be used to predict localised extreme risk?

General circulation models (GCMs) are not weather forecast models, and are not well-suited to the prediction of high-impact extreme weather events such as heatwaves. However, there is urgent demand from decision-makers for increasingly specific predictions of socio-economic risks associated with extreme weather under climate change. GCMs are therefore repurposed from their intended setting, as indicators of large-scale, long-term average climate trends, to a new class of tasks relating to prediction of localised extreme weather events on decision-relevant multi-decadal timescales.

Since extreme events simulated by GCMs cannot interpreted as synchronous forecasts of particular weather events, they are aggregated to give a statistical prediction of the occurrence of future extremes. This approach relies on an assumption that the model credibly represents climate variability at the spatial and temporal scale in question. The extent to which this assumption is justified, and the spatial and temporal granularity at which this assumption fails, is not well-established. Repurposing GCMs in this way is facilitated by a wide range of statistical post-processing methods including bias correction, downscaling and multi-model ensembles. However, post-processing may over-extend the applicability of GCMs into a setting where they lack predictive skill, and produce misleadingly confident and precise results. It is critical that climate risk predictions provided to decision-makers are trustworthy.

In this work we aim to provide a statistical analysis of the adequacy of the CMIP6 GCM ensemble for prediction of climate extremes. For a historical reference period, a simulated daily time-series of temperature at a specific grid location is reordered to optimally match the equivalent observed daily series. Given that simulations are understood to be asynchronous, but assumed to provide valid statistics at some time-scale, each simulated daily temperature may be interchanged within an allowed window of days from its original date. After applying the optimal reordering algorithm, the RMSE of simulations of historical temperature extremes (the upper and lower 10% and 5% quantiles of observed temperatures) is calculated. Applying this method across a span of allowed reordering windows is used to give a measure of models’ predictive skill for extremes across time-scales.

This approach provides an intuitive illustration of a continuity between the tasks for which GCMs are unsuited (synchronous forecasting of particular extreme weather events) and well-suited (long-term trend prediction). Given this continuity, there is some time-scale of aggregation at which simulated statistics of extremes may be considered credible for a specified variable, location and extreme quantile. Our method allows this scale to be found and compared across models. In this way, the trustworthiness of downstream extreme risk predictions can be evaluated. The work presented here can be extended to improve the GCM post-processing pipeline by development of multi-model ensembles that utilise this analysis for optimal model selection and combination.


Oreskes, N., Stainforth, D.A. and Smith, L.A., 2010. Adaptation to global warming: do climate models tell us what we need to know?. Philosophy of Science, 77(5), pp.1012-1028.

Heymann, M. and Hundebøl, N.R., 2017. From heuristic to predictive: Making climate models into political instruments. Cultures of Prediction in Atmospheric and Climate Science, pp.100-119.

Hulme, M., Pielke, R. and Dessai, S., 2009. Keeping prediction in perspective. Nature Climate Change, 1(911), pp.126-127.

Meehl, G.A., Goddard, L., Murphy, J., Stouffer, R.J., Boer, G., Danabasoglu, G., Dixon, K., Giorgetta, M.A., Greene, A.M., Hawkins, E.D. and Hegerl, G., 2009. Decadal prediction: can it be skillful?. Bulletin of the American Meteorological Society, 90(10), pp.1467-1486.

Hawkins, E. and Sutton, R., 2011. The potential to narrow uncertainty in projections of regional precipitation change. Climate dynamics, 37(1), pp.407-418.

Sillmann, J., Kharin, V.V., Zhang, X., Zwiers, F.W. and Bronaugh, D., 2013. Climate extremes indices in the CMIP5 multimodel ensemble: Part 1. Model evaluation in the present climate. Journal of Geophysical Research: Atmospheres, 118(4), pp.1716-1733.

Schwingshackl, C., Sillmann, J., Vicedo‐Cabrera, A.M., Sandstad, M. and Aunan, K., 2021. Heat stress indicators in CMIP6: estimating future trends and exceedances of impact‐relevant thresholds. Earth’s Future, 9(3), p.e2020EF001885.