{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using Natural Gradient Descent with Variational Models\n",
    "\n",
    "## Overview\n",
    "\n",
    "In this notebook, we'll demonstrate how to use **natural gradient descent** when optimizing variational QPyTorch models. This will be the same as the [SVQEP regression notebook](./SVQEP_Regression_CUDA.ipynb), except we will be using a different optimizer.\n",
    "\n",
    "## What is natural gradient descent (NGD)?\n",
    "\n",
    "Without going into too much detail, using SGD or Adam isn't the best way to optimize the parameters of variational q-exponential distributions. Essentially, SGD takes steps assuming that the loss geometry of the parameters is Euclidean. This is a bad assumption for the parameters of many distributions, especially Gaussians. See [Agustinus Kristiadi's blog post](https://wiseodd.github.io/techblog/2018/03/14/natural-gradient/) for more detail.\n",
    "\n",
    "Instead, it makes more sense to take gradient steps that have been scaled better for the geometry of the variational distribution's parameters. Specifically, if $\\mathbf m$ and $\\mathbf S$ are the mean and covariance of our variational q-exponential posterior approximation, then we will achieve faster convergence if we take the following steps:\n",
    "\n",
    "\\begin{align}\n",
    "\\begin{bmatrix} \\mathbf m \\\\ \\mathbf S \\end{bmatrix}\n",
    "\\leftarrow\n",
    "\\begin{bmatrix} \\mathbf m \\\\ \\mathbf S \\end{bmatrix}\n",
    "-\n",
    "\\alpha \\mathcal{\\mathbf F}^{-1} \\nabla \\begin{bmatrix} \\mathbf m \\\\   \\mathbf S \\end{bmatrix}\n",
    "\\end{align}\n",
    "\n",
    "where $\\alpha$ is a step size and $\\mathbf F$ is the **Fisher information matrix** corresponding to this distribution. This is known as **natural gradient descent**, or **NGD**.\n",
    "\n",
    "It turns out that for Gaussian distributions (and, more broadly, for all distributions in the exponential family), there are efficient update equations for NGD. See the following papers for more information:\n",
    "- [Salimbeni, Hugh, Stefanos Eleftheriadis, and James Hensman. \"Natural gradients in practice: Non-conjugate variational inference in gaussian process models.\" AISTATS (2018).](https://arxiv.org/abs/1803.09151)\n",
    "- [Hensman, James, Magnus Rattray, and Neil D. Lawrence. \"Fast variational inference in the conjugate exponential family.\" NeurIPS (2012).](http://papers.nips.cc/paper/4766-fast-variational-inference-in-the-conjugate-exponential-family)\n",
    "\n",
    "## Jointly optimizing variational parameters/hyperparameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tqdm\n",
    "import math\n",
    "import torch\n",
    "import qpytorch\n",
    "from matplotlib import pyplot as plt\n",
    "\n",
    "# Make plots inline\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this example notebook, we'll be using the `elevators` UCI dataset used in the paper. Running the next cell downloads a copy of the dataset that has already been scaled and normalized appropriately. For this notebook, we'll simply be splitting the data using the first 80% of the data as training and the last 20% as testing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import urllib.request\n",
    "import os\n",
    "from scipy.io import loadmat\n",
    "from math import floor\n",
    "\n",
    "\n",
    "# this is for running the notebook in our testing framework\n",
    "smoke_test = ('CI' in os.environ)\n",
    "\n",
    "\n",
    "if not smoke_test and not os.path.isfile('../elevators.mat'):\n",
    "    print('Downloading \\'elevators\\' UCI dataset...')\n",
    "    urllib.request.urlretrieve('https://drive.google.com/uc?export=download&id=1jhWL3YUHvXIaftia4qeAyDwVxo6j1alk', '../elevators.mat')\n",
    "\n",
    "\n",
    "if smoke_test:  # this is for running the notebook in our testing framework\n",
    "    X, y = torch.randn(1000, 3), torch.randn(1000)\n",
    "else:\n",
    "    data = torch.Tensor(loadmat('../elevators.mat')['data'])\n",
    "    X = data[:, :-1]\n",
    "    X = X - X.min(0)[0]\n",
    "    X = 2 * (X / X.max(0)[0]) - 1\n",
    "    y = data[:, -1]\n",
    "\n",
    "\n",
    "train_n = int(floor(0.8 * len(X)))\n",
    "train_x = X[:train_n, :].contiguous()\n",
    "train_y = y[:train_n].contiguous()\n",
    "\n",
    "test_x = X[train_n:, :].contiguous()\n",
    "test_y = y[train_n:].contiguous()\n",
    "\n",
    "if torch.cuda.is_available():\n",
    "    train_x, train_y, test_x, test_y = train_x.cuda(), train_y.cuda(), test_x.cuda(), test_y.cuda()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following steps create the dataloader objects. See the [SVQEP regression notebook](./SVQEP_Regression_CUDA.ipynb) for details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torch.utils.data import TensorDataset, DataLoader\n",
    "train_dataset = TensorDataset(train_x, train_y)\n",
    "train_loader = DataLoader(train_dataset, batch_size=1024, shuffle=True)\n",
    "\n",
    "test_dataset = TensorDataset(test_x, test_y)\n",
    "test_loader = DataLoader(test_dataset, batch_size=1024, shuffle=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SVQEP models for NGD\n",
    "\n",
    "There are **three** key differences between NGD-SVQEP models and the standard SVQEP models from the [SVQEP regression notebook](./SVQEP_Regression_CUDA.ipynb).\n",
    "\n",
    "\n",
    "### Difference #1: NaturalVariationalDistribution\n",
    "\n",
    "Rather than using `qpytorch.variational.CholeskyVarationalDistribution` (or other variational distribution objects), you have to use one of the two objects:\n",
    "\n",
    "- `qpytorch.variational.NaturalVariationalDistribution` (typically faster optimization convergence, but less stable for non-conjugate likelihoods)\n",
    "- `qpytorch.variational.TrilNaturalVariationalDistribution` (typically slower optimization convergence, but more stable for non-conjugate likelihoods)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "POWER = 1.0\n",
    "class QEPModel(qpytorch.models.ApproximateQEP):\n",
    "    def __init__(self, inducing_points):\n",
    "        self.power = torch.tensor(POWER)\n",
    "        variational_distribution = qpytorch.variational.NaturalVariationalDistribution(inducing_points.size(0), power=self.power)\n",
    "        variational_strategy = qpytorch.variational.VariationalStrategy(\n",
    "            self, inducing_points, variational_distribution, learn_inducing_locations=True\n",
    "        )\n",
    "        super(QEPModel, self).__init__(variational_strategy)\n",
    "        self.mean_module = qpytorch.means.ConstantMean()\n",
    "        self.covar_module = qpytorch.kernels.ScaleKernel(qpytorch.kernels.RBFKernel())\n",
    "        \n",
    "    def forward(self, x):\n",
    "        mean_x = self.mean_module(x)\n",
    "        covar_x = self.covar_module(x)\n",
    "        return qpytorch.distributions.MultivariateQExponential(mean_x, covar_x, power=self.power)\n",
    "\n",
    "\n",
    "inducing_points = train_x[:500, :]\n",
    "model = QEPModel(inducing_points=inducing_points)\n",
    "likelihood = qpytorch.likelihoods.QExponentialLikelihood(power=model.power)\n",
    "\n",
    "if torch.cuda.is_available():\n",
    "    model = model.cuda()\n",
    "    likelihood = likelihood.cuda()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Difference #2: Two optimizers - one for the variational parameters; one for the hyperparameters\n",
    "\n",
    "NGD steps only update the variational parameters. Therefore, we need two separate optimizers: one for the variational parameters (using NGD) and one for the other hyperparameters (using Adam or whatever you want).\n",
    "\n",
    "Some things to note about the NGD variational optimizer:\n",
    "\n",
    "- **You must use `qpytorch.optim.NGD` as the variational NGD optimizer!** Adaptive gradient algorithms will mess up the natural gradient steps. (Any stochastic optimizer works for the hyperparameters.)\n",
    "- **Use a large learning rate for the variational optimizer.** Typically, 0.1 is a good learning rate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "variational_ngd_optimizer = qpytorch.optim.NGD(model.variational_parameters(), num_data=train_y.size(0), lr=0.1)\n",
    "\n",
    "hyperparameter_optimizer = torch.optim.Adam([\n",
    "    {'params': model.hyperparameters()},\n",
    "    {'params': likelihood.parameters()},\n",
    "], lr=0.01)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Difference #3: The updated training loop\n",
    "\n",
    "In the training loop, we have to update both optimizers (`variational_ngd_optimizer` and `hyperparameter_optimizer`). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2b39b0113ede442481536d66307b5db0",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Epoch:   0%|          | 0/4 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "1f5bef0731c24936a53bf7e6c98f5efa",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Minibatch:   0%|          | 0/13 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "eda3401fb102413e89a1d0ceedb02dde",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Minibatch:   0%|          | 0/13 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "1c2cb7516e25475b9f754864e43b9c15",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Minibatch:   0%|          | 0/13 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "e36b089f71614c49811706f28653b616",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Minibatch:   0%|          | 0/13 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "model.train()\n",
    "likelihood.train()\n",
    "mll = qpytorch.mlls.VariationalELBO(likelihood, model, num_data=train_y.size(0))\n",
    "\n",
    "num_epochs = 1 if smoke_test else 4\n",
    "epochs_iter = tqdm.notebook.tqdm(range(num_epochs), desc=\"Epoch\")\n",
    "for i in epochs_iter:\n",
    "    minibatch_iter = tqdm.notebook.tqdm(train_loader, desc=\"Minibatch\", leave=False)\n",
    "    \n",
    "    for x_batch, y_batch in minibatch_iter:\n",
    "        ### Perform NGD step to optimize variational parameters\n",
    "        variational_ngd_optimizer.zero_grad()\n",
    "        hyperparameter_optimizer.zero_grad()\n",
    "        output = model(x_batch)\n",
    "        loss = -mll(output, y_batch)\n",
    "        minibatch_iter.set_postfix(loss=loss.item())\n",
    "        loss.backward()\n",
    "        variational_ngd_optimizer.step()\n",
    "        hyperparameter_optimizer.step()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You could also modify the optimization loop to alternate between the NGD step and the hyperparameter updates:\n",
    "\n",
    "```python\n",
    "    for x_batch, y_batch in minibatch_iter:\n",
    "        ### Perform NGD step to optimize variational parameters\n",
    "        variational_ngd_optimizer.zero_grad()\n",
    "        output = model(x_batch)\n",
    "        loss = -mll(output, y_batch)\n",
    "        minibatch_iter.set_postfix(loss=loss.item())\n",
    "        loss.backward()\n",
    "        variational_ngd_optimizer.step()\n",
    "        \n",
    "        ### Perform Adam step to optimize hyperparameters\n",
    "        hyperparameter_optimizer.zero_grad()\n",
    "        output = model(x_batch)\n",
    "        loss = -mll(output, y_batch)\n",
    "        loss.backward()\n",
    "        hyperparameter_optimizer.step()\n",
    "```\n",
    "\n",
    "### Evaluation\n",
    "\n",
    "That's it! This model should converge must faster/better than the standard SVQEP model - and it can often get better performance (especially when using more inducing points)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test MAE: 0.0765795111656189\n"
     ]
    }
   ],
   "source": [
    "model.eval()\n",
    "likelihood.eval()\n",
    "means = torch.tensor([0.])\n",
    "with torch.no_grad():\n",
    "    for x_batch, y_batch in test_loader:\n",
    "        preds = model(x_batch)\n",
    "        means = torch.cat([means, preds.mean.cpu()])\n",
    "means = means[1:]\n",
    "print('Test MAE: {}'.format(torch.mean(torch.abs(means - test_y.cpu()))))"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}