Metrics in QPyTorch

In this notebook, we will see how to evaluate QPyTorch models with probabilistic metrics.

Note: It is encouraged to check the Simple QP Regression notebook first if not done already. We’ll reuse most of the code from there.

We’ll be modeling the function

\[\begin{split}\begin{align} y &= \sin(2\pi x) + \epsilon \\ \epsilon &\sim \mathcal{N}(0, 0.04) \end{align}\end{split}\]
[1]:
import math
import torch
import qpytorch
from matplotlib import pyplot as plt

%matplotlib inline
%load_ext autoreload
%autoreload 2

In the next cell, we set up the train and test data.

[2]:
# Training data is 100 points in [0,1] inclusive regularly spaced
train_x = torch.linspace(0, 1, 100)
# True function is sin(2*pi*x) with Gaussian noise
train_y = torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * math.sqrt(0.04)

test_x = torch.linspace(0, 1, 51)
test_y = torch.sin(test_x * (2 * math.pi)) + torch.randn(test_x.size()) * math.sqrt(0.04)

In the next cell, we define a simple QP regression model.

[3]:
# We will use the simplest form of QP model, exact inference
POWER = 1.0
class ExactQEPModel(qpytorch.models.ExactQEP):
    def __init__(self, train_x, train_y, likelihood):
        super(ExactQEPModel, self).__init__(train_x, train_y, likelihood)
        self.power = torch.tensor(POWER)
        self.mean_module = qpytorch.means.ConstantMean()
        self.covar_module = qpytorch.kernels.ScaleKernel(qpytorch.kernels.RBFKernel())

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return qpytorch.distributions.MultivariateQExponential(mean_x, covar_x, power=self.power)

# initialize likelihood and model
likelihood = qpytorch.likelihoods.QExponentialLikelihood(power=torch.tensor(POWER))
model = ExactQEPModel(train_x, train_y, likelihood)

Our model is ready for hyperparameter learning, but, first let us check how it performs on the test data.

[4]:
model.eval()
with torch.no_grad():
    untrained_pred_dist = likelihood(model(test_x))
    predictive_mean = untrained_pred_dist.mean
    lower, upper = untrained_pred_dist.confidence_region()

f, ax = plt.subplots(1, 1, figsize=(4, 3))
# Plot training data as black stars
ax.plot(train_x, train_y, 'k*')
# Plot predictive means as blue line
ax.plot(test_x, predictive_mean, 'b')
# Shade between the lower and upper confidence bounds
ax.fill_between(test_x, lower, upper, alpha=0.5)
ax.set_ylim([-3, 3])
ax.legend(['Observed Data', 'Mean', 'Confidence'], bbox_to_anchor=(1.6,1));
../../_images/examples_00_Basic_Usage_Metrics_7_0.png

Visually, this does not look like a good fit. Now, let us train the model hyperparameters.

[5]:
# this is for running the notebook in our testing framework
import os
smoke_test = ('CI' in os.environ)
training_iter = 2 if smoke_test else 80


# Find optimal model hyperparameters
model.train()

# Use the adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)  # Includes QExponentialLikelihood parameters

# "Loss" for QEPs - the marginal log likelihood
mll = qpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(training_iter):
    # Zero gradients from previous iteration
    optimizer.zero_grad()
    # Output from model
    output = model(train_x)
    # Calc loss and backprop gradients
    loss = -mll(output, train_y)
    loss.backward()
    print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f' % (
        i + 1, training_iter, loss.item(),
        model.covar_module.base_kernel.lengthscale.item(),
        model.likelihood.noise.item()
    ))
    optimizer.step()
Iter 1/80 - Loss: 1.700   lengthscale: 0.693   noise: 0.693
Iter 2/80 - Loss: 1.670   lengthscale: 0.644   noise: 0.644
Iter 3/80 - Loss: 1.636   lengthscale: 0.598   noise: 0.598
Iter 4/80 - Loss: 1.597   lengthscale: 0.555   noise: 0.554
Iter 5/80 - Loss: 1.551   lengthscale: 0.514   noise: 0.513
Iter 6/80 - Loss: 1.496   lengthscale: 0.475   noise: 0.474
Iter 7/80 - Loss: 1.434   lengthscale: 0.439   noise: 0.437
Iter 8/80 - Loss: 1.368   lengthscale: 0.405   noise: 0.402
Iter 9/80 - Loss: 1.304   lengthscale: 0.372   noise: 0.369
Iter 10/80 - Loss: 1.249   lengthscale: 0.342   noise: 0.339
Iter 11/80 - Loss: 1.204   lengthscale: 0.314   noise: 0.310
Iter 12/80 - Loss: 1.170   lengthscale: 0.289   noise: 0.284
Iter 13/80 - Loss: 1.143   lengthscale: 0.267   noise: 0.260
Iter 14/80 - Loss: 1.121   lengthscale: 0.248   noise: 0.237
Iter 15/80 - Loss: 1.101   lengthscale: 0.232   noise: 0.217
Iter 16/80 - Loss: 1.082   lengthscale: 0.219   noise: 0.198
Iter 17/80 - Loss: 1.065   lengthscale: 0.208   noise: 0.181
Iter 18/80 - Loss: 1.048   lengthscale: 0.199   noise: 0.165
Iter 19/80 - Loss: 1.032   lengthscale: 0.192   noise: 0.151
Iter 20/80 - Loss: 1.015   lengthscale: 0.186   noise: 0.138
Iter 21/80 - Loss: 0.998   lengthscale: 0.182   noise: 0.126
Iter 22/80 - Loss: 0.981   lengthscale: 0.180   noise: 0.115
Iter 23/80 - Loss: 0.963   lengthscale: 0.178   noise: 0.105
Iter 24/80 - Loss: 0.945   lengthscale: 0.177   noise: 0.096
Iter 25/80 - Loss: 0.926   lengthscale: 0.178   noise: 0.087
Iter 26/80 - Loss: 0.907   lengthscale: 0.179   noise: 0.080
Iter 27/80 - Loss: 0.888   lengthscale: 0.181   noise: 0.073
Iter 28/80 - Loss: 0.867   lengthscale: 0.184   noise: 0.066
Iter 29/80 - Loss: 0.847   lengthscale: 0.188   noise: 0.060
Iter 30/80 - Loss: 0.826   lengthscale: 0.193   noise: 0.055
Iter 31/80 - Loss: 0.805   lengthscale: 0.198   noise: 0.050
Iter 32/80 - Loss: 0.783   lengthscale: 0.205   noise: 0.046
Iter 33/80 - Loss: 0.761   lengthscale: 0.212   noise: 0.042
Iter 34/80 - Loss: 0.739   lengthscale: 0.220   noise: 0.038
Iter 35/80 - Loss: 0.717   lengthscale: 0.228   noise: 0.035
Iter 36/80 - Loss: 0.694   lengthscale: 0.238   noise: 0.032
Iter 37/80 - Loss: 0.672   lengthscale: 0.248   noise: 0.029
Iter 38/80 - Loss: 0.650   lengthscale: 0.259   noise: 0.026
Iter 39/80 - Loss: 0.627   lengthscale: 0.270   noise: 0.024
Iter 40/80 - Loss: 0.606   lengthscale: 0.283   noise: 0.022
Iter 41/80 - Loss: 0.584   lengthscale: 0.295   noise: 0.020
Iter 42/80 - Loss: 0.563   lengthscale: 0.309   noise: 0.018
Iter 43/80 - Loss: 0.542   lengthscale: 0.322   noise: 0.017
Iter 44/80 - Loss: 0.521   lengthscale: 0.336   noise: 0.015
Iter 45/80 - Loss: 0.501   lengthscale: 0.351   noise: 0.014
Iter 46/80 - Loss: 0.481   lengthscale: 0.365   noise: 0.013
Iter 47/80 - Loss: 0.462   lengthscale: 0.380   noise: 0.012
Iter 48/80 - Loss: 0.444   lengthscale: 0.394   noise: 0.011
Iter 49/80 - Loss: 0.426   lengthscale: 0.408   noise: 0.010
Iter 50/80 - Loss: 0.409   lengthscale: 0.421   noise: 0.009
Iter 51/80 - Loss: 0.393   lengthscale: 0.433   noise: 0.008
Iter 52/80 - Loss: 0.377   lengthscale: 0.444   noise: 0.007
Iter 53/80 - Loss: 0.362   lengthscale: 0.453   noise: 0.007
Iter 54/80 - Loss: 0.346   lengthscale: 0.461   noise: 0.006
Iter 55/80 - Loss: 0.331   lengthscale: 0.467   noise: 0.006
Iter 56/80 - Loss: 0.316   lengthscale: 0.471   noise: 0.005
Iter 57/80 - Loss: 0.302   lengthscale: 0.473   noise: 0.005
Iter 58/80 - Loss: 0.288   lengthscale: 0.474   noise: 0.004
Iter 59/80 - Loss: 0.274   lengthscale: 0.474   noise: 0.004
Iter 60/80 - Loss: 0.260   lengthscale: 0.473   noise: 0.004
Iter 61/80 - Loss: 0.247   lengthscale: 0.471   noise: 0.003
Iter 62/80 - Loss: 0.235   lengthscale: 0.468   noise: 0.003
Iter 63/80 - Loss: 0.224   lengthscale: 0.466   noise: 0.003
Iter 64/80 - Loss: 0.213   lengthscale: 0.463   noise: 0.003
Iter 65/80 - Loss: 0.202   lengthscale: 0.461   noise: 0.002
Iter 66/80 - Loss: 0.192   lengthscale: 0.458   noise: 0.002
Iter 67/80 - Loss: 0.183   lengthscale: 0.457   noise: 0.002
Iter 68/80 - Loss: 0.175   lengthscale: 0.456   noise: 0.002
Iter 69/80 - Loss: 0.166   lengthscale: 0.455   noise: 0.002
Iter 70/80 - Loss: 0.159   lengthscale: 0.455   noise: 0.002
Iter 71/80 - Loss: 0.151   lengthscale: 0.455   noise: 0.002
Iter 72/80 - Loss: 0.145   lengthscale: 0.456   noise: 0.001
Iter 73/80 - Loss: 0.138   lengthscale: 0.458   noise: 0.001
Iter 74/80 - Loss: 0.132   lengthscale: 0.460   noise: 0.001
Iter 75/80 - Loss: 0.127   lengthscale: 0.462   noise: 0.001
Iter 76/80 - Loss: 0.122   lengthscale: 0.464   noise: 0.001
Iter 77/80 - Loss: 0.117   lengthscale: 0.467   noise: 0.001
Iter 78/80 - Loss: 0.113   lengthscale: 0.469   noise: 0.001
Iter 79/80 - Loss: 0.109   lengthscale: 0.472   noise: 0.001
Iter 80/80 - Loss: 0.106   lengthscale: 0.474   noise: 0.001

In the next cell, we reevaluate the model on the test data.

[6]:
model.eval()
with torch.no_grad():
    trained_pred_dist = likelihood(model(test_x))
    predictive_mean = trained_pred_dist.mean
    lower, upper = trained_pred_dist.confidence_region(rescale=True)

f, ax = plt.subplots(1, 1, figsize=(4, 3))
# Plot training data as black stars
ax.plot(train_x, train_y, 'k*')
# Plot predictive means as blue line
ax.plot(test_x, predictive_mean, 'b')
# Shade between the lower and upper confidence bounds
ax.fill_between(test_x, lower, upper, alpha=0.5)
ax.set_ylim([-3, 3])
ax.legend(['Observed Data', 'Mean', 'Confidence'], bbox_to_anchor=(1.6,1));
../../_images/examples_00_Basic_Usage_Metrics_11_0.png

Now our model seems to fit well on the data. It is not always possible to visually evaluate the model in high dimensional cases. Thus, now we evaluate the models with help of probabilistic metrics. We have saved predictive distributions from untrained and trained models as untrained_pred_dist and trained_pred_dist respectively.

Negative Log Predictive Density (NLPD)

Negative Log Predictive Density (NLPD) is the most standard probabilistic metric for evaluating GP models. In simple terms, it is negative log likelihood of the test data given the predictive distribution. It can be computed as follows:

[7]:
init_nlpd = qpytorch.metrics.negative_log_predictive_density(untrained_pred_dist, test_y)
final_nlpd = qpytorch.metrics.negative_log_predictive_density(trained_pred_dist, test_y)

print(f'Untrained model NLPD: {init_nlpd:.2f}, \nTrained model NLPD: {final_nlpd:.2f}')
Untrained model NLPD: 1.45,
Trained model NLPD: -0.29

Mean Standardized Log Loss (MSLL)

This metric computes average negative log likelihood of all test points w.r.t their univariate predicitve densities. For more details, see “page No. 23, Gaussian Processes for Machine Learning, Carl Edward Rasmussen and Christopher K. I. Williams, The MIT Press, 2006. ISBN 0-262-18253-X”

[8]:
init_msll = qpytorch.metrics.mean_standardized_log_loss(untrained_pred_dist, test_y)
final_msll = qpytorch.metrics.mean_standardized_log_loss(trained_pred_dist, test_y)

print(f'Untrained model MSLL: {init_msll:.2f}, \nTrained model MSLL: {final_msll:.2f}')
Untrained model MSLL: 0.88,
Trained model MSLL: 13.94

It is also possible to calculate the quantile coverage error with qpytorch.metrics.quantile_coverage_error function.

[9]:
quantile = 95
qce = qpytorch.metrics.quantile_coverage_error(trained_pred_dist, test_y, quantile=quantile)
print(f'Quantile {quantile}% Coverage Error: {qce:.2f}')
Quantile 95% Coverage Error: 0.56

Mean Squared Error (MSE)

Mean Squared Error (MSE) is the mean of the squared difference between the test observations and the predictive mean. It is a well-known conventional metric for evaluating regression models. However, it can not take uncertainty into account unlike NLPD, MLSS and ACE.

[10]:
init_mse = qpytorch.metrics.mean_squared_error(untrained_pred_dist, test_y, squared=True)
final_mse = qpytorch.metrics.mean_squared_error(trained_pred_dist, test_y, squared=True)

print(f'Untrained model MSE: {init_mse:.2f}, \nTrained model MSE: {final_mse:.2f}')
Untrained model MSE: 0.19,
Trained model MSE: 0.03

Mean Absolute Error (MAE)

Mean Absolute Error (MAE) is the mean of the absolute difference between the test observations and the predictive mean. It is less sensitive to the outliers than MSE.

[11]:
init_mae = qpytorch.metrics.mean_absolute_error(untrained_pred_dist, test_y)
final_mae = qpytorch.metrics.mean_absolute_error(trained_pred_dist, test_y)

print(f'Untrained model MAE: {init_mae:.2f}, \nTrained model MAE: {final_mae:.2f}')
Untrained model MAE: 0.37,
Trained model MAE: 0.13