{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using Natural Gradient Descent with Variational Models\n", "\n", "## Overview\n", "\n", "In this notebook, we'll demonstrate how to use **natural gradient descent** when optimizing variational QPyTorch models. This will be the same as the [SVQEP regression notebook](./SVQEP_Regression_CUDA.ipynb), except we will be using a different optimizer.\n", "\n", "## What is natural gradient descent (NGD)?\n", "\n", "Without going into too much detail, using SGD or Adam isn't the best way to optimize the parameters of variational q-exponential distributions. Essentially, SGD takes steps assuming that the loss geometry of the parameters is Euclidean. This is a bad assumption for the parameters of many distributions, especially Gaussians. See [Agustinus Kristiadi's blog post](https://wiseodd.github.io/techblog/2018/03/14/natural-gradient/) for more detail.\n", "\n", "Instead, it makes more sense to take gradient steps that have been scaled better for the geometry of the variational distribution's parameters. Specifically, if $\\mathbf m$ and $\\mathbf S$ are the mean and covariance of our variational q-exponential posterior approximation, then we will achieve faster convergence if we take the following steps:\n", "\n", "\\begin{align}\n", "\\begin{bmatrix} \\mathbf m \\\\ \\mathbf S \\end{bmatrix}\n", "\\leftarrow\n", "\\begin{bmatrix} \\mathbf m \\\\ \\mathbf S \\end{bmatrix}\n", "-\n", "\\alpha \\mathcal{\\mathbf F}^{-1} \\nabla \\begin{bmatrix} \\mathbf m \\\\ \\mathbf S \\end{bmatrix}\n", "\\end{align}\n", "\n", "where $\\alpha$ is a step size and $\\mathbf F$ is the **Fisher information matrix** corresponding to this distribution. This is known as **natural gradient descent**, or **NGD**.\n", "\n", "It turns out that for Gaussian distributions (and, more broadly, for all distributions in the exponential family), there are efficient update equations for NGD. See the following papers for more information:\n", "- [Salimbeni, Hugh, Stefanos Eleftheriadis, and James Hensman. \"Natural gradients in practice: Non-conjugate variational inference in gaussian process models.\" AISTATS (2018).](https://arxiv.org/abs/1803.09151)\n", "- [Hensman, James, Magnus Rattray, and Neil D. Lawrence. \"Fast variational inference in the conjugate exponential family.\" NeurIPS (2012).](http://papers.nips.cc/paper/4766-fast-variational-inference-in-the-conjugate-exponential-family)\n", "\n", "## Jointly optimizing variational parameters/hyperparameters" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import tqdm\n", "import math\n", "import torch\n", "import qpytorch\n", "from matplotlib import pyplot as plt\n", "\n", "# Make plots inline\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this example notebook, we'll be using the `elevators` UCI dataset used in the paper. Running the next cell downloads a copy of the dataset that has already been scaled and normalized appropriately. For this notebook, we'll simply be splitting the data using the first 80% of the data as training and the last 20% as testing." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import urllib.request\n", "import os\n", "from scipy.io import loadmat\n", "from math import floor\n", "\n", "\n", "# this is for running the notebook in our testing framework\n", "smoke_test = ('CI' in os.environ)\n", "\n", "\n", "if not smoke_test and not os.path.isfile('../elevators.mat'):\n", " print('Downloading \\'elevators\\' UCI dataset...')\n", " urllib.request.urlretrieve('https://drive.google.com/uc?export=download&id=1jhWL3YUHvXIaftia4qeAyDwVxo6j1alk', '../elevators.mat')\n", "\n", "\n", "if smoke_test: # this is for running the notebook in our testing framework\n", " X, y = torch.randn(1000, 3), torch.randn(1000)\n", "else:\n", " data = torch.Tensor(loadmat('../elevators.mat')['data'])\n", " X = data[:, :-1]\n", " X = X - X.min(0)[0]\n", " X = 2 * (X / X.max(0)[0]) - 1\n", " y = data[:, -1]\n", "\n", "\n", "train_n = int(floor(0.8 * len(X)))\n", "train_x = X[:train_n, :].contiguous()\n", "train_y = y[:train_n].contiguous()\n", "\n", "test_x = X[train_n:, :].contiguous()\n", "test_y = y[train_n:].contiguous()\n", "\n", "if torch.cuda.is_available():\n", " train_x, train_y, test_x, test_y = train_x.cuda(), train_y.cuda(), test_x.cuda(), test_y.cuda()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following steps create the dataloader objects. See the [SVQEP regression notebook](./SVQEP_Regression_CUDA.ipynb) for details." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from torch.utils.data import TensorDataset, DataLoader\n", "train_dataset = TensorDataset(train_x, train_y)\n", "train_loader = DataLoader(train_dataset, batch_size=1024, shuffle=True)\n", "\n", "test_dataset = TensorDataset(test_x, test_y)\n", "test_loader = DataLoader(test_dataset, batch_size=1024, shuffle=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SVQEP models for NGD\n", "\n", "There are **three** key differences between NGD-SVQEP models and the standard SVQEP models from the [SVQEP regression notebook](./SVQEP_Regression_CUDA.ipynb).\n", "\n", "\n", "### Difference #1: NaturalVariationalDistribution\n", "\n", "Rather than using `qpytorch.variational.CholeskyVarationalDistribution` (or other variational distribution objects), you have to use one of the two objects:\n", "\n", "- `qpytorch.variational.NaturalVariationalDistribution` (typically faster optimization convergence, but less stable for non-conjugate likelihoods)\n", "- `qpytorch.variational.TrilNaturalVariationalDistribution` (typically slower optimization convergence, but more stable for non-conjugate likelihoods)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "POWER = 1.0\n", "class QEPModel(qpytorch.models.ApproximateQEP):\n", " def __init__(self, inducing_points):\n", " self.power = torch.tensor(POWER)\n", " variational_distribution = qpytorch.variational.NaturalVariationalDistribution(inducing_points.size(0), power=self.power)\n", " variational_strategy = qpytorch.variational.VariationalStrategy(\n", " self, inducing_points, variational_distribution, learn_inducing_locations=True\n", " )\n", " super(QEPModel, self).__init__(variational_strategy)\n", " self.mean_module = qpytorch.means.ConstantMean()\n", " self.covar_module = qpytorch.kernels.ScaleKernel(qpytorch.kernels.RBFKernel())\n", " \n", " def forward(self, x):\n", " mean_x = self.mean_module(x)\n", " covar_x = self.covar_module(x)\n", " return qpytorch.distributions.MultivariateQExponential(mean_x, covar_x, power=self.power)\n", "\n", "\n", "inducing_points = train_x[:500, :]\n", "model = QEPModel(inducing_points=inducing_points)\n", "likelihood = qpytorch.likelihoods.QExponentialLikelihood(power=model.power)\n", "\n", "if torch.cuda.is_available():\n", " model = model.cuda()\n", " likelihood = likelihood.cuda()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Difference #2: Two optimizers - one for the variational parameters; one for the hyperparameters\n", "\n", "NGD steps only update the variational parameters. Therefore, we need two separate optimizers: one for the variational parameters (using NGD) and one for the other hyperparameters (using Adam or whatever you want).\n", "\n", "Some things to note about the NGD variational optimizer:\n", "\n", "- **You must use `qpytorch.optim.NGD` as the variational NGD optimizer!** Adaptive gradient algorithms will mess up the natural gradient steps. (Any stochastic optimizer works for the hyperparameters.)\n", "- **Use a large learning rate for the variational optimizer.** Typically, 0.1 is a good learning rate." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "variational_ngd_optimizer = qpytorch.optim.NGD(model.variational_parameters(), num_data=train_y.size(0), lr=0.1)\n", "\n", "hyperparameter_optimizer = torch.optim.Adam([\n", " {'params': model.hyperparameters()},\n", " {'params': likelihood.parameters()},\n", "], lr=0.01)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Difference #3: The updated training loop\n", "\n", "In the training loop, we have to update both optimizers (`variational_ngd_optimizer` and `hyperparameter_optimizer`). " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "2b39b0113ede442481536d66307b5db0", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Epoch: 0%| | 0/4 [00:00