Surrogate Models — Gaussian Process

The Gaussian process (GP) model, also known as the Kriging model, is a popular surrogate model that gives an approximate prediction of the QoIs, is probabilistic, and computes the variance of the prediction at every sample point in the input parameter space. A GP is specified by its mean, also known as the trend, and a covariance function, also called a kernel or correlation function. The hyperparameters of the mean and covariance functions are optimized while training the GP model. Consider input parameters x and a scalar QoI y. The GP model can then be expressed as

where m(x) is the mean function and f(x) is a GP with zero mean and covariance kθ(x, x'). Here, θ represents the hyperparameters of the covariance function.

It is then possible to write the joint distribution of the evaluated data and the predicted data as

where f = y − m(x), y is the evaluated QoI, and f' and x' are the predicted QoI and its corresponding input parameters.

The posterior distribution of the GP is written as

where μ(x) and cov(x, x') are the mean and covariance of the model, respectively:

Covariance functions are a crucial ingredient of GPs that determine the shape of the prior and posterior of the model. They encode the assumption that the function is being learned by defining the similarity of two input parameter points (stationary kernel) or specific values of the input parameter points (nonstationary kernel). The module provides three common stationary kernels and one nonstationary kernel. The stationary kernels are the spectral exponential kernel

and the Matérn class kernel with ν = 3/2 and ν = 5/2

where r is a function of x and x' defined as

The nonstationary kernel is the single-layer neural network kernel

Here,

is defined as

and ∑ = diag(σ02, σ2), where σ0 and σ = (σ1, …, σm) are the hyperparameters. Further details about the covariance function and the Gaussian process can be found in Ref. 7.

The covariance functions are assumed to be anisotropic. The automatic relevance determination (ARD) method is used to define different length scales for each input parameter. The ARD method is given its name because estimating the length scale parameters implicitly determines the “relevance” of each dimension. Input dimensions with relatively large length scales imply relatively little variation along those dimensions in the QoI function being modeled.

The spectral exponential kernel has strong smoothness where the Matérn kernel with ν = 5/2 is less smooth, and the Matérn kernel with ν = 3/2 is more rough. The GP model with a spectral exponential kernel is infinitely differentiable, whereas the GP model with a Matérn kernel is differentiable a finite number of times. Comparing the Matérn kernel and a spectral exponential kernel with the same length scale, the Matérn kernel has weaker correlation and is more oscillatory. Samples from a GP with the one-layer neural network kernel can be viewed as superpositions of functions with a single-layer neural network with erf(xT ∑ x') as the activation function. The GP model created with a single-layer neural network is smooth. In the region with a large absolute value of x, they approximate to a constant value.

Regardless of which covariance matrix is used, it is often the case that it needs to be inverted while training the GP model. The inversion is known to suffer from numerical instabilities. To circumvent such limitations, a nugget factor defined as σ2I is added to the covariance matrix. Here, σ is learned as an additional hyperparameter during the training of the GP.

There are three types of mean functions for GPs:

•

Constant:

•

Linear:

•

Quadratic:

GP models with constant and linear mean functions are also known as ordinary Kriging and universal Kriging, respectively. The coefficients for the mean functions are also learned during the training of the GP.

If there is a priori information of the mean function, it can be used to increase the modeling accuracy. One example of using a mean function is that when a GP is used for sensitivity analysis and uncertainty propagation, the extrapolation of the GP far from the sample points might be computed. If there is a priori information of the mean, using a linear or quadratic mean function can be beneficial for the extrapolation.

While training the GP, the hyperparameters of covariance functions, mean functions, as well as the nugget factor are learned by maximizing the log-likelihood of the posterior of the GP:

Since the log-likelihood may have multiple local optima, the local optimizer for training the hyperparameters needs to be started repeatedly by specifying different starting points in the hyperparameter space. This surrogate model provides an automatic number of restart points for training the hyperparameters. These points are chosen such that they are in the regions where the hyperparameters render relatively small negative log-likelihood. Also, the restarting points are space filling in the hyperparameter space. These points are obtained by first generating a large amount of possible choices and then using a K-cluster mean method to select different starting points following the method described in Ref. 8 from the Journal of Statistical Software.

For adaptive GPs, the GP is constructed adaptively by adding a new input parameter point at the location of the maximum entropy, when it is used for sensitivity analysis or uncertainty propagation, or at the location of the maximum expected feasibility function, when it is used for reliability analysis. For sensitivity analysis or uncertainty propagation, the GP is trained to approximate the underlying model accurately in the entire input parameter space. It follows the heuristics in Ref. 9, which adds a new point at the location with the highest entropy. Here, the entropy is defined as the standard deviation of the GP model. For reliability analysis, the GP is trained to accurately approximate the underlying model only near the regions that satisfy the reliability conditions, where the maximum expected feasibility function is used to find the adaptation points. More details about the expected feasibility function can be found in Reliability Analysis — Efficient Global Reliability Analysis.

To explore the region in the input parameter space and exploit the most-recently-built GP, a global optimization problem is solved in every adaptation step in finding the location corresponding to the maximum entropy or maximum expected feasibility function. There are two types of global optimization methods, the dividing rectangles (DIRECT) method (see Ref. 10) and the Monte Carlo method. The default choice is the DIRECT method. For nonadaptive GPs, this global optimization procedure is used to determine the maximum entropy as the error estimation of the model. Note that nonadaptive GP models are not used for reliability analysis.

Note that adaptive GP models require enough initial model evaluations for the adaptation method to locate “good” adaptation points that can improve the accuracy of the GP model. Given an initial dataset with too few sample points, exploration of the unsearched region may not be properly done. One way to fix this problem is using the improve and analyze functionality to append more samples using sequential optimal LHS, where the optimal LHS samples randomly fill the entire input parameter space.

Also note that for the adaptive GP built on input parameters with long tails (near-zero lower CDF and near-one upper CDF), most initial sample points will be located at the region where input parameters have high probability. The accuracy of the GP model in the region where input parameters have low probability will be low due to the lack of sample points, thus most adaptation points will probably be placed in the low probability region. A better way to create a GP model with high accuracy in the entire input-parameter space in this circumstance would be using a uniform distribution from the minimum and maximum value corresponding to the lower and upper CDF.