This is a analytic walk-through of the following paper: ClimateBench v1.0: A Benchmark for Data-Driven Climate Projections https://doi.org/10.1029/2021MS002954 (Watson-Paris et al., 2022)


conda environment

The analysis was conducted under a conda virtual environment (esem). I’ll attached the .yml file of the environment together with the assignment, you can install it with conda env create -f esem.yml. Or if you prefer to install it manually, here are the commands I used:

conda create --name esem python scipy numpy pandas xarray matplotlib cartopy seaborn

pip install netcdf4 #this is better than using conda, less likely have error of .DLL files

pip install "dask[dataframe]"

pip install xskillscore

#you probably don’t need the following if you don’t want to train emulators yourselves

  • pip install eofs
  • pip install gpflow
  • pip install -U scikit-learn
  • pip install esem

The last will install the model package used in paper (https://github.com/duncanwp/ESEm/tree/v1.1.0): it will also automatically install the dependent packages tensorflow. Other packages such as keras and scikit-learn are also needed if you want to retrain the emulators from paper. (Because their original code specified hyperparameters using functions in these packages… for percisely reproducing the emulator and generating data, it’s better to use their setting)

dataset from paper for building emulators

data that I used to evalute emulators

(The important part of the paper is to evaluate results generated from emulators, NOT building emulators! But the authors did not provide emulators' results, instead, he provided the training and testing dataset, listed above. So I downloaded the data, built emulators myself, generated emulators' results using the template and the same packages as the paper…)

  • test dataset: https://zenodo.org/record/7064308/files/test.tar.gz?download=1 (This is the same, for comparing emulation with the truth generated from models)
  • outputs_ssp245_prediction_RF.nc; outputs_ssp245_predict_pr90.nc; outputs_ssp245_predict_pr.nc; outputs_ssp245_predict_tas.nc; outputs_ssp245_predict_dtr.nc (These are emulators results I generated using the default template of packages setting)

ClimateBench v1.0: A Benchmark for Data-Driven Climate Projections

1. Introduction

  • There are a wide range of emission pathways consistent with climate goals;
  • Traditionally, scientists used Earth System Models (ESMs) to simulate those Shared Socioeconomic Pathways (SSPs), and explore what will happen under future scenarios;
  • Potential problems in traditional approach: Earth System Models are expensive to run, and it’s impractical to use expensive models to fully explore the space of possibilities (O’Neill et al., 2016)
  • Solution proposed in the paper: ClimateBench, using a set of baseline machine/statistical learning models to emulate the response to a variety of forcers (potentially exploring a wider range of SSPs in a cheaper way).

Idea and assumptions behind this approach: The state-of-the-art statistical/machine learning models have shown potential to capture nonlinear relationship in long-term climate responses (Mansfield et al., 2020). And these methods are less computational costly compared with ESMs that traditionally used to simulation climate projections. In the paper, these statistical models have been employed as “emulator” using training dataset from CMIP6 outputs. These emulators then can be used as a cheaper but robust alternatives to explore wide range of emission pathways, to predict annual mean global distributions of temperature and precipitation under different unexplored scenarioes.

The analytic goals in the paper:

Providing evidences that statistical emulators are robust in capturing nonlinear relationship in the Earth system (i.e. the emulator can reproduce the targeted output traditionally generated through ESMs)

  • Comparing different statisical/machine learning methods that used to build emulators (i.e. benchmarking their performances in emulating climate projections)
  • Evaluating the emulators by assessing whether the emulation results fulfill physical constraints (i.e. the emulated outputs are meanful and are likely not the results of over-fitting)

2. Data and emulator

2.1 Data

Input Variables: global mean emissions of carbon dioxide($CO_{2}$) and methane($CH_4$), first 5 principal components of sulfur dioxide($SO_2$) and balck carbon($BC$)

Output variables: temperature (surface air temperature); precipitation (total precipitation); diural temperature range (difference in daily maximum and minimum surface air temperature); extreme precipitation (90th percentile of the daily precipitation)

Task for emulator: predict the output variables using only the input variables under the chosen test scenario ssp245. The emulators will be evaluated based on its prediction skills over 2080-2100.

Training dataset for emulator: The training dataset for the emulator comes from the following scenarios: the historical data; ssp126; ssp370; ssp585; and historical data with aerosol (hist-aer) and greenhouse gas (high-GHG). The description of these scenario dataset are listed in the table below:

Protocol Experiment Period Notes
ScenarioMIP ssp126 2015-2100 A high ambition scenario designed to produce significantly less than 2° warming by 2100
ssp245 2015-2100 Designed to represent a medium forcing future scenario. This is the test scenario to be held back for evaluation
ssp370 2015-2100 A medium-high forcing scenario with high emissions of near-term climate forcers (NTCF) such as methane and aerosol
ssp585 2015-2100 This scenario represents the high end of the range of future pathways in the IAM literature and leads to a very large forcing of 8.5 Wm−2 in 2100
CMIP6 historical 1850-2014 A simulation using historical emissions of all forcing agents designed to recreate the historically observed climate
DAMIP hist-aer 1850-2014 A historical simulation with varying concentrations for CO2 and other long-lived greenhouse-gases (only)
hist-GHG 1850-2014 A historical simulation only forced by changes in anthropogenic aerosol

In the training process for emulators, they used the input and output variables in each of the selected scenarios as training dataset, and optimized the hyperparameters in the emulator through available optimisation algorithm in the ESEM package.

2.2 Method in building emulators

2.2.1 Random Forest

Tree-based method repeatedly split data into subsets according to its features such that in-subset variance is low and between-subset variance is high. Describing training different trees on different subsets of the data or holding back some of the data dimensions for each individual tree. The Forest makes a prediction by averaging over the predictions of all individual trees.

Training Setting:

  • training data: global mean emissions of $CO_{2}$ and $CH_4$, first 5 principal components of $SO_2$ and $BC$
  • The following hyperparameters are tuned using random search of the training data without replacement: number of trees, tree depth, number of samples required to split a node and to be at each leaf node.
2.2.2 Neural Network

The chosen architecture consists of a CNN followed by an LSTM built with the Keras library. The CNN includes one convolutional layer with 20 filters, a filter size of 3, and a ReLU activation function. The 3 × 3 pixel filters scan the input images to detect spatial patterns and feed these patterns to the next layer. These next layers are average pooling layers that reduce the spatial dimensionality ahead of the LSTM layer.

Training Setting:

  • training data time-series is segmented into 10-year chunks, using a moving-time window in one-year increments, leading to 754 training samples of shape (10, 96, 144, 4) corresponding to the number of years, latitude, longitude and then number of variables.
  • trained four different emulators for the four different output variables.
  • not to do any hyperparameter optimization, and all the parameters were chosen manually.

3. walk-through analysis of 2 emulators

Here I reproduced the evaluation process for 2 of the baseline-emulators described in the paper (and above).

The methods used in the paper to evalute emulators are:

  1. t-test
  2. rmse
  3. compare with physical constrained relationships

which are demostrated below:

import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import xarray as xr
from glob import glob
import seaborn as sns
import numpy as np
import pandas as pd
# read the model data, and emulation data I generated
data_path = "C:\\Users\\maq\\scripts\\gg606\\data\\"
# In order to rerun the script, change the above to wherever the data is stored

variables = ['tas', 'dtr', 'pr', 'pr90']
output_ssp245 = xr.open_dataset(data_path  + 'outputs_ssp245.nc').rename(diurnal_temperature_range="dtr").mean('member')
rf_predictions = xr.open_dataset(data_path  + 'outputs_ssp245_prediction_RF.nc').swap_dims(sample='time')
rf_predictions = rf_predictions.assign_coords(time=rf_predictions.time+1).rename(diurnal_temperature_range="dtr")
nn_predictions = xr.merge([{v: xr.open_dataarray(data_path  + "outputs_ssp245_predict_{}.nc".format(v))} for v in variables])
# Convert the precip values to mm/day, only rf has done this lol
output_ssp245["pr"] *= 86400
output_ssp245["pr90"] *= 86400
nn_predictions["pr"] *= 86400
nn_predictions["pr90"] *= 86400
#Now that we have 3 dataset: one is the original Truth, other 2 are data predicted by emulators
models = [output_ssp245, rf_predictions, nn_predictions]
model_names = ['output_ssp245','Random Forest', 'Neural Network']
var_names = labels = ["Temperature (K)", "Diurnal temperature range (K)", "Precipitation (mm/day)", "Extreme precipitation (mm/day)"]

#rf_predictions
# t-test
def ttest(diff_mean, diff_std, diff_num):
    from scipy.stats import distributions
    z = diff_mean / np.sqrt(diff_std ** 2 / diff_num)
    # use np.abs to get upper tail, 
    # then multiply by two as this is a two-tailed test
    p = distributions.t.sf(np.abs(z), diff_num - 1) * 2
    return z, p

p_level = 0.05

proj = ccrs.Robinson() # this proj is sooooo slow...
kwargs = [dict(cmap="coolwarm", vmin=-1, vmax=1), dict(cmap="coolwarm", vmin=-0.5, vmax=0.5), dict(cmap="coolwarm", vmin=-1, vmax=1), dict(cmap="coolwarm", vmin=-2, vmax=2)]
with sns.plotting_context("talk"):
    fig, axes = plt.subplots(4, 2, subplot_kw=dict(projection=proj), figsize=(16, 16), constrained_layout=True)
    #print(axes)
    for fig_axes, var, var_name, kws in zip(axes, variables, var_names, kwargs):
        for ax, model, model_name in zip(fig_axes, models[1:], model_names[1:]):
            ax.set_title(model_name)
            #print(model_name, model)
            diff = (model[var]-models[0][var]).sel(time=slice(2080, 2100)) 
            mean_diff = diff.mean('time')
            #print(mean_diff)
            _, p = ttest(mean_diff, diff.std('time'), diff.count('time'))
            if model_name == 'Neural Network':
                mean_diff.where(p < p_level).plot(ax=ax, add_labels=False, transform=ccrs.PlateCarree(), cbar_kwargs={"label":var_name, "orientation":'vertical'}, **kws)
            else:
                mean_diff.where(p < p_level).plot(ax=ax, add_labels=False, transform=ccrs.PlateCarree(), add_colorbar=False, **kws)
            ax.coastlines()

Through t-test, we produced maps of the mean difference in the ClimateBench target output variables for 2 baseline emulators against the target values under the test ssp245 scenario averaged between 2080 and 2100. Differences insignificant at the p < 5% level are masked from the plots. The figure shows a similar bias structure of the selected 2 emulator. For example, both emulators tend to over predict the temperature in the northern hemisphere, and underpredict the temperature over the ocean in the southern hemisphere. The colorbar indicated the actual value difference bvetween emulation and simulation. Overall, Random Forest emulator tends to produce a larger difference in targeted variables, the neural network performs the best in predicting targeted variables.

# rmse
from xskillscore import rmse
#weight by lat
weights = np.cos(np.deg2rad(output_ssp245['tas'].lat)).expand_dims(lon=144).assign_coords(lon=output_ssp245.lon)
def global_mean(ds):
    weights = np.cos(np.deg2rad(ds.lat))
    return ds.weighted(weights).mean(['lat', 'lon'])

RMSE_S = pd.DataFrame({
    name: {var: rmse(output_ssp245[var].sel(time=slice(2080, None)).mean('time'), 
                               model[var].sel(time=slice(2080, None)).mean('time'), weights=weights).data/ np.abs(global_mean(output_ssp245[var].sel(time=slice(2080, None)).mean('time')).data) for var in variables} 
    for name, model in zip(model_names[1:], models[1:])
})

RMSE_G = pd.DataFrame({
    name: {var: rmse(global_mean(output_ssp245[var].sel(time=slice(2080, None))), 
                                global_mean(model[var].sel(time=slice(2080, None)))).data/ np.abs(global_mean(output_ssp245[var].sel(time=slice(2080, None)).mean('time')).data) for var in variables} 
    for name, model in zip(model_names[1:], models[1:])
})

RMSE_T = RMSE_S + 5*RMSE_G

merge_rmse = pd.concat([RMSE_S, RMSE_G, RMSE_T], keys=['Spatial', 'Global', 'Total'])[model_names[1:]].T.swaplevel(axis=1)[variables]
merge_rmse.style.highlight_min(axis = 0, props='font-weight: bold').format("{:.3f}")
  tas dtr pr pr90
  Spatial Global Total Spatial Global Total Spatial Global Total Spatial Global Total
Random Forest 0.108 0.058 0.400 9.195 2.652 22.457 2.524 0.502 5.035 2.682 0.543 5.399
Neural Network 0.102 0.043 0.316 8.493 1.679 16.885 2.191 0.209 3.235 2.765 0.318 4.356

The primary metrics the evaluate emulators is the normalized, global mean root-mean square error (NRMSE). It is a metric the quantify the deviation between emulated output variables and the trageted output variables in the test scenario. The smaller RMSE value is, the lesser error the emulator is produced.

In the paper, they constructed spatial, global and total NRMSE in the following forms:

Spatial NRMSE: $$NRMSE_s = \sqrt{<(|x_{i,j,t}|t - |y{i,j,t,n}|{n,t})^2>} / |<y{i,j}>|_{t,n}$$

Global NRMSE: $$NRMSE_g = \sqrt{|(<x_{i,j,t}> - <|y_{i,j,t,n}|{n}>)^2|{t}} / |<y_{i,j}>|_{t,n}$$

Total NRMSE: $$NRMSE_t = NRMSE_s + \alpha * NRMSE_g$$

$<x_{i,j}>$ indicated the global mean with a weighting function that accounts for the decreasing grid-cell area toward the poles

$$<x_{i,j}> = \frac{1}{N_{lat}N_{lon}} \sum_{i}^{N_{lat}} \sum_{j}^{N_{lon}} cos(lat(i))x_{i,j}$$

$\alpha$ is a defined weight coefficient and $\alpha = 5$ in this paper.

As shown in the table above, the neural network emulator seems to perform better in terms of predicting temperature and precipitation changes. In most targeted value and most metrics, the neural network emulator tends to score a smaller RMSE and has a lesser difference between its emulations and original model simulations.

# Clausius-Clapeyron relationship and energy conservation considerations
baseline_precip = 2.335
with sns.plotting_context("talk"):
    x = np.linspace(0, 2.5, 100)
    fig, ax = plt.subplots(1, 1, figsize=(10, 8))
    for model, model_name in zip(models, model_names):
        smooth = global_mean(model).coarsen(time=5, boundary='pad').mean().dropna('time')
        x_, y_ = smooth['tas'], smooth['pr']/baseline_precip*100
        s=ax.scatter(x_.sel(time=slice(2030, None)), y_.sel(time=slice(2030, None)), label=model_name)   
    plt.plot(x, x*6, label="Clausius-Clapeyron (6%/K)", ls='-', c='k')
    plt.plot(x, x*2-1, label="Energetic constraint (2%/K)", ls='--', c='r')
    plt.setp(plt.gca(), xlabel="Temperature change (K)", ylabel="Precipitation change (%)", xlim=[0, 2.5], ylim=[0, 4])
    plt.legend(loc='upper left')

One common concern on applying statistical emulators on Earth system modelling is their abilities to capture and demostrate the physical relationships, whether emulators are overfitted on the existed dataset. In the paper, they try to demostrate the trustworthyness of emulator by assessing whether emulations can capture the same relationship under physical constraints. Particularly, they chose Clausius-Clapeyron relationship to assess. This relationship refers to relative change in global mean precipitation as a function of global mean temperature change. In simple atmospheric physics, Clausius-Clapeyron relationship could be explained as, as the air temperature increases, the water vapor needs more exchange of latent heat to complete a phase transition (precipitation) than previously. The atmosphere has more ability to hold water content under a higher temperature, and the atmospheric water content increases by between 6 and 7% per 1 K of temperature changes if we only consider precipitation changes in a local air mass.

However, in the real Earth system, we also need to consider the energy constraints when assessing global precipitation changes. The global precipitation changes will be balanced by the radiative cooling from the cloud (Pendergrass & Hartmann, 2013) and the relative change of global precipitaion under temperature change will be around 2% per 1 K.

The above figure assessed the predicted temperature changes vs relative global precipitation changes in emulations and simulations results. The slope of the red dash line indicated the ideal relationship of temperature and precipitation. The figure shows a scatter distribution of the relative changes in precipitation vs temperature, if the slope/tendency of the data close to the slope of the red line, it indicates that the changes predicted in emulations is close to the ideal physical relationship. The figure indicated that teh neural network emulator perform better in terms of capturing the physical relationship between changes in global precipitation and temperature, and its predicted relationship is close to both the ideal value and the model simulation results under targeted SSP245 scenario.

4. Critique

4.1 Data

The data used in the paper to construct emulators came from validated peer-review papers and projects using Earth system modelling to perform climate projection under given future scenarios. These model simulations data are quite assessible and the quality of data is very good.

However, the trained emulator and emulations results used in the paper for analysis is not provided. In order to reproduce the paper, I need to use the ML algorithm in given packages and the author’s hyperparameters, with the same model simulation data to train the emulator myself. This created a bit of barriers in reproducing the paper. It is still possible to reproduce the paper, and the results from the paper are validated through my reproductions.

4.2 method

4.2.1 Emulator

As the goal of this paper is to perform a benchmarking work and assess different methods to construct emulator, the selection of specific method for emulators are therefore important. In the paper, the assessed emulators are all constructed using popular and powerful machine learning method that have been applied in emulating nonlinearity relationship in the Earth system, thus made the emulators used in this research more applicable to the task.

The process of contructing emulators are also following similar approaches in the given field where the relationships between emitted greenhouse gases and temperature/precipitation changes have been used to construct the emulators. The hyperpaparmeters selected for each emulator are reasonable and listed in its appendix, and the used packages are well-constructed and assessible from GitHub, which makes the reproduction processes for this paper easier.

In terms of emulation itself, the emulators were employed in an interpretation problems, where a boarder range of SSPs are used in training and only one targeted SSPs (ssp245) within the range is used for testing. Using statistical/machine learning models for interpretation is a safer approach compared with for exterpolation problems, and could potentially avoid overfitting. However, the selected SSPs are of a quite small range, and the data employed in training are not fully independent. This is part of the issue from limited model simulations to explore SSPs, and also comes from the time dependent nature in the Earth system itself, where the variable in one timestep is relevant to its previous timesteps. Therefore, although the authors used several SSPs simulations and their data on different timesteps for training, the dataset are not entirely independent and might affect the emulators' results.

4.2.2 Evaluation

The methods used for evaluations are quite trustworthy and useful.

The authors employed Student’s T-test to assess whether there are significant difference bvetween emulations and simulations. By showing maps of differences, it is clear for readers to find out the spatial bias structures for each emulator, and potentially help deliver a better illustration for emulators' biases. In terms of specific explanation for t-test (and its normal distribution approximation), the author did not include a detailed explantion but pointed to a book of statistics. This is probably because t-test are widely used in the atmospheric science and it’s more efficient to assume readers understand this commonn method.

The RMSE metrics are a clear and straight-forward way to quantify the errors produced by emulators, which sucessfully provide a numeric value for evaluation and fulfill the gaps from spatial maps of t-test difference. However, the authors did not provide a detailed explanation for each different RMSE values they constructed, and they did not explain why they favor one contructed RMSE metrics over another. The Spatial and Global RMSE used in the paper are only differed in when to perform an average obver time dimention (before or after calculating the difference), and whether the average process neglect time variability in the selected timeframes. As the task for emulators in this paper is the prediction the targeted temperature and precipitation changes in a given period, and the time variability in the atmosphere is not the first priority in this question. Nevertheless, the explanations about different RMSE metrics used in the research could still be useful.

Lastly, assessing the relative relationship between global precipitation and temperature changes is a good way to outline the ability of emulators to capture physical relationship under energy constraints. Given that the targeted outputs in the research are either precipitation or temperature, the selected relationship here is useful for examination, and it’s smart to demonstrate this through a scatterplot which fully illustrate the relationship between 2 variables.

4.3 Structure and layouts

The style of the figure in the paper is quite clear and appearing, and fulfill the common requirement in scientific publication. Apart from benchmarking different emulators, the paper also carries an introduction for readers that are not specialized in emulation and calibration. Therefore, in the first half of teh paper, the authors took quite an effort to introduce to background of the research, which dataset is acquired and which emulator has been employed, and also illustrate the characteristics of the dataset with several figures. These figure are not directly used for analysis, but for introduction of the background and dataset, which may create a clearer introdution but also push the first analytic figure to Figure 4 in the paper. Therefore, in this analytic demo, I selected the useful figure for benchmarking and evaluating emulators, in a concise way. Overall, the original paper is quite informatic and provide a nice demonstration for benchmarking and evaluating selected emulators for climate projection.

Reference

  • Mansfield, L. A., Nowack, P. J., Kasoar, M., Everitt, R. G., Collins, W. J., & Voulgarakis, A. (2020). Predicting global patterns of long-term climate change from short-term simulations using machine learning. NPJ Climate and Atmospheric Science, 3(1), 44. https://doi.org/10.1038/s41612-020-00148-5
  • O’Neill, B. C., Tebaldi, C., Vuuren, D. P., van Eyring, V., Friedlingstein, P., Hurtt, G., et al. (2016). The scenario model intercomparison project (ScenarioMIP) for CMIP6. Geoscientific Model Development, 9, 3461–3482. https://doi.org/10.5194/gmd-9-3461-2016
  • Pendergrass, A. G., & Hartmann, D. L. (2013). The atmospheric energy constraint on global-mean precipitation change. Journal of Climate, 27(2), 130916120136005. https://doi.org/10.1175/jcli-d-13-00163.1
  • Watson-Parris, D., Williams, A., Deaconu, L., & Stier, P. (2021). Model calibration using ESEm v1.1.0 – an open, scalable Earth system emulator. Geoscientific Model Development, 14(12), 7659–7672. https://doi.org/10.5194/gmd-14-7659-2021
  • Watson-Parris, D., Rao, Y., Olivié, D., Seland, Ø., Nowack, P., Camps-Valls, G., Stier, P., Bouabid, S., Dewey, M., Fons, E., Gonzalez, J., Harder, P., Jeggle, K., Lenhardt, J., Manshausen, P., Novitasari, M., Ricard, L., & Roesch, C. (2022). ClimateBench v1.0: A Benchmark for Data-Driven Climate Projections. Journal of Advances in Modeling Earth Systems, 14(10), e2021MS002954. https://doi.org/10.1029/2021MS002954