Gaussian Processes

Code and explanation mostly from Nando de Freitas, http://www.cs.ubc.ca/~nando/540-2013/lectures.html

In [29]:
import numpy as np
import matplotlib.pyplot as pl
%matplotlib inline

Assumption

  • Data is smooth, i.e. if the $\vec x$ are similar then the $y$-values are similar.

"Smoothness Prior"

Noisless GP Regression

$$ \mathcal D_{train} = \{ (\vec x^{(1)}, f^{(1)}), \dots, (\vec x^{(m)}, f^{(m)})\} $$

with

  • $n$-dimensional vectors $\vec x^{(j)}$
  • $f$ is a function of $\vec x$: $f^{(m)} = f^{(m)}(x^{(m)})$

Goal:

Given a test set $X_*$ of size $m_* \times n$ we want to predict the (function) outputs $f_*$:

$$ \left(\begin{array}{c} \vec f \\ \vec f_* \end{array} \right) = \mathcal N \left(\left( \begin{array}{c} \vec \mu \\ \vec \mu_* \end{array} \right) , \left(\begin{array}{cc} K & K_* \\ K_*^T & K_{**} \end{array} \right) \right) $$

$$ \vec f^T = (f^{(1)}, f^{(2)}, \dots, f^{(n)}) $$ and analog $\vec f_*^T$, $\vec \mu$ and $\vec \mu_*$

Kernel

with

  • $K = \kappa(X, X)$ is $m \times m$
  • $K_* = \kappa(X, X_*)$ is $m \times m_*$
  • $K_{**} = \kappa(X_*, X_*)$ is $m_* \times m_*$

and the kernel $\kappa$, e.g.
$$ \kappa (x, x') = \sigma_f^2 \exp \left( - \frac{(x-x')^2}{2 l^2}\right) $$

  • $l$: hyper parameter
  • $\sigma_f^2$: ?
In [2]:
def kernel(a, b, kernelParameter = 0.1, sigma_square = 1.):
    """ GP squared exponential kernel 
    
    \kappa (x, x') = \sigma_f^2 \exp \left( - \frac{(x-x')^2}{2 l^2}\right) 
    Parameters
    ----------
    a : numpy array (shape: m x n)
        Design matrix:  row index == data index; column index == feature index
    
    b : numpy array (shape: m_* x n)
        Design matrix:  row index == data index; column index == feature index    
        
    kernelParameter: float
        kernelParameter = l^2
    sigma_square: float
        \sigma_f^2
        
    """
    # efficient implementation in numpy:
    # squared euclidian distance between all row vectors of a and b.
    sqdist = np.sum(a**2,1).reshape(-1,1) + np.sum(b**2,1) - 2*np.dot(a, b.T)
    
    return sigma_square * np.exp(-.5 * (1/kernelParameter) * sqdist)
In [3]:
X = np.array([[1,2,3],[2,2,3],[3,2,3]])
X_ = np.array([[4,2,3],[2,2,4]])
kernel(X, X_, 2)
Out[3]:
array([[ 0.10539922,  0.60653066],
       [ 0.36787944,  0.77880078],
       [ 0.77880078,  0.60653066]])

Multivariate Bayesian Theorem

Let $x_1$ and $x_2$ be random vectors distributed joinly with

$$ p(\vec x_2 \mid \vec x_1) = \mathcal N \left( \vec x_2 \mid \vec \mu_{2 \mid 1}, \Sigma_{2 \mid 1} \right) $$

$$ \mu_{2 \mid 1} = \mu_2 + \Sigma_{2 \mid 1} \Sigma_{11}^{-1}(\vec x_1 - \mu_1) $$

$$ \Sigma_{2 \mid 1} = \Sigma_{22} - \Sigma_{} $$

$$ p(\vec f_* \mid X_*, X, \vec f) = \mathcal N(\vec f_* \mid \vec \mu_*, \Sigma_*) $$

with (by the Multivariate Gaussian Theorem, see K.Murphy ML, Theorem 4.2.1): $$ \vec \mu_* = \vec \mu (X_*) + K_*^T K^{-1} ( \vec f - \vec \mu (X) ) $$ $$ \Sigma_{*} = K_{**} - K_*^T K^{-1} K_* $$

What is $\vec \mu (X_*)$ resp. $\vec \mu (X)$ ? These are the means of the priors.

Noisy GP Regression

$$ y = f(\vec x) + \epsilon $$ with

  • $\epsilon = \mathcal N(0, \sigma_y^2)$

$$ p(\vec y \mid X) = \int p(\vec y \mid f, X) p(\vec f \mid X) d\vec f $$

with zero-mean prior: $$ p(\vec f \mid X) = \mathcal N (f \mid 0, K) $$

the noise of each data point is independent: $$ p(y \mid f) = \prod_{j=1}^m \mathcal N(f^{(i)}, \sigma_y^2) $$

with the multivariate Gaussian theorem:

$$ \text{cov} \left[ \vec y \mid X \right] = K + \sigma_y^2 {\bf I}_n =: K_y $$ so we just need to modify the kernel: $K \rightarrow K_y$

$$ \left( \begin{array}{c} \vec f \\ \vec f_* \end{array} \right) = \mathcal N \left( \left( \begin{array}{c} \vec \mu \\ \vec \mu_* \end{array} \right) , \left( \begin{array}{cc} K_y & K_* \\ K_*^T & K_{**} \end{array} \right) \right) $$

$$ p(\vec f_* \mid X_*, X, \vec f) = \mathcal N(\vec f_* \mid \vec \mu_*, \Sigma_*) $$

$$ \vec \mu_* = \vec \mu (X_*) + K_*^T K_y^{-1} ( \vec y - \vec \mu (X) ) $$ $$ \Sigma_{*} = K_{**} - K_*^T K_y^{-1} K_* $$

Prior for the noise (inverse wishard distribution in 1-D) i.e. inverse Gamma distribution, see

https://en.wikipedia.org/wiki/Inverse-gamma_distribution

In [4]:
# This is the true unknown function we are trying to approximate
f = lambda x: np.sin(0.9*x).flatten()
#f = lambda x: (0.25*(x**2)).flatten()
In [5]:
N = 5        # number of training points.
n = 100         # number of test points.
s = 0.00005    # noise variance.


# Sample some input points and noisy versions of the function evaluated at
# these points. 
X = np.random.uniform(-5, 5, size=(N,1))
y = f(X) + s*np.random.randn(N)
In [6]:
# this is also a parameter, we must estimate from the data!!! TODO
# e.g. by max. likelihood, etc.
s_y = 0.001
In [7]:
var_y = np.eye(len(X)) * s_y
In [8]:
# points we're going to make predictions at.
Xtest = np.linspace(-5, 5, n).reshape(-1,1)
In [9]:
kernelParameter = 2.
In [10]:
K = kernel(X, X, kernelParameter = kernelParameter)
K_y = K + var_y
k_ = kernel(X, Xtest, kernelParameter = kernelParameter)

# naive numerically unstable version
K_inv = np.linalg.inv(K_y)
alpha = np.dot(K_inv, y)
mu = np.dot(k_.T, alpha)
In [11]:
K_ = kernel(Xtest, Xtest, kernelParameter = kernelParameter)
s_2 = K_ - np.dot( k_.T, np.dot(K_inv, k_) )     
s_ = np.sqrt(np.diag(s_2))
In [12]:
pl.figure(1)
pl.clf()
pl.plot(X, y, 'r+', ms=20)
pl.plot(Xtest, f(Xtest), 'b-')
pl.gca().fill_between(Xtest.flat, mu-3*s_, mu+3*s_, color="#dddddd")
pl.plot(Xtest, mu, 'r--', lw=2)
pl.savefig('predictive.png', bbox_inches='tight')
pl.title('Mean predictions plus 3 st.deviations')
pl.axis([-5, 5, -3, 3])
Out[12]:
[-5, 5, -3, 3]

Faster and numerically more stable implementation

In the implementation there is a matrix inversion of $K_y$. The matrix $K_y$ is symmetric and positive definite.

The inversion of such a matrix can be done more stable and faster by using the Cholesky decomposition.

A symmetric (more general Hermitian) positive definite matrix $K$ can be factorized $$ K = L L^T $$ with

  • $L$ is a lower triangular matrix with real and positive diagonal entries
  • and $L^T$ is the transpose of $L$.
In [13]:
# same code snipped as above
K = kernel(X, X, kernelParameter = kernelParameter)
K_y = K + var_y
In [14]:
L = np.linalg.cholesky(K_y)
L
Out[14]:
array([[ 1.00049988,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.09580453,  0.99590235,  0.        ,  0.        ,  0.        ],
       [ 0.13339719,  0.97872192,  0.15908673,  0.        ,  0.        ],
       [ 0.20207284,  0.91560838,  0.33473585,  0.09889277,  0.        ],
       [ 0.14997737, -0.01421492, -0.03578134, -0.044307  ,  0.98745193]])
In [15]:
np.allclose(L, np.tril(L)) # check if lower triangular
Out[15]:
True

It is much easier to compute the inverse of a triangular matrix and there exist numerical solutions, see http://www.mcs.csueastbay.edu/~malek/TeX/Triangle.pdf

So we use this to compute $K_y^{-1}$:

$$ K_{y}^{-1} = \left(L L^T \right)^{-1} = \left(L^T\right)^{-1} L^{-1} = L^{-T} L^{-1} $$

Note: $\left( A B\right)^{-1} = B^{-1} A^{-1}$

$$ \vec \mu_* = \vec \mu (X_*) + K_*^T K^{-1} ( \vec y - \vec \mu (X) ) $$ with prior $\mu (.) = 0$ $$ \vec \mu_* = K_*^T K_y^{-1} \vec f = K_*^T L^{-T} L^{-1} \vec f = \left( (L^{-1} K_*)^T \right) \left( L^{-1} \vec f \right) $$

For solving $\left( (L^{-1} K_*)^T \right)$ and $\left( L^{-1} \vec f \right)$ we can use the fact that $L$ is a lower triangular matrix.

With scipy by scipy.linalg.solve_triangular.

In [16]:
import scipy.linalg
help(scipy.linalg.solve_triangular)
Help on function solve_triangular in module scipy.linalg.basic:

solve_triangular(a, b, trans=0, lower=False, unit_diagonal=False, overwrite_b=False, debug=None, check_finite=True)
    Solve the equation `a x = b` for `x`, assuming a is a triangular matrix.
    
    Parameters
    ----------
    a : (M, M) array_like
        A triangular matrix
    b : (M,) or (M, N) array_like
        Right-hand side matrix in `a x = b`
    lower : bool, optional
        Use only data contained in the lower triangle of `a`.
        Default is to use upper triangle.
    trans : {0, 1, 2, 'N', 'T', 'C'}, optional
        Type of system to solve:
    
        ========  =========
        trans     system
        ========  =========
        0 or 'N'  a x  = b
        1 or 'T'  a^T x = b
        2 or 'C'  a^H x = b
        ========  =========
    unit_diagonal : bool, optional
        If True, diagonal elements of `a` are assumed to be 1 and
        will not be referenced.
    overwrite_b : bool, optional
        Allow overwriting data in `b` (may enhance performance)
    check_finite : bool, optional
        Whether to check that the input matrices contain only finite numbers.
        Disabling may give a performance gain, but may result in problems
        (crashes, non-termination) if the inputs do contain infinities or NaNs.
    
    Returns
    -------
    x : (M,) or (M, N) ndarray
        Solution to the system `a x = b`.  Shape of return matches `b`.
    
    Raises
    ------
    LinAlgError
        If `a` is singular
    
    Notes
    -----
    .. versionadded:: 0.9.0

In [17]:
k_ = kernel(X, Xtest, kernelParameter = kernelParameter)
In [18]:
Lk = scipy.linalg.solve_triangular(L, k_, lower=True)
Ly = scipy.linalg.solve_triangular(L, y, lower=True)
mu = np.dot(Lk.T, Ly)

analog $$ \Sigma_{*} = K_{**} - K_*^T K_y^{-1} K_* = K_{**} - K_*^T L^{-T}L^{-1} K_* = K_{**} - \left(L^{-1} K_*\right)^T \left(L^{-1} K_* \right) $$

In [21]:
K_ = kernel(Xtest, Xtest, kernelParameter = kernelParameter)

# we need only the diagonal elements of $\Sigma_*$:
s2 = np.diag(K_) - np.sum(Lk**2, axis=0)
s_ = np.sqrt(s2)
In [23]:
# PLOTS:
pl.figure(1)
pl.clf()
pl.plot(X, y, 'r+', ms=20)
pl.plot(Xtest, f(Xtest), 'b-')
pl.gca().fill_between(Xtest.flat, mu-3*s_, mu+3*s_, color="#dddddd")
pl.plot(Xtest, mu, 'r--', lw=2)
pl.savefig('predictive.png', bbox_inches='tight')
pl.title('Mean predictions plus 3 st.deviations')
pl.axis([-5, 5, -3, 3])
Out[23]:
[-5, 5, -3, 3]
In [30]:
##TODO explaination

# draw samples from the prior at our test points.
L = np.linalg.cholesky(K_  + s * np.eye(n) )
f_prior = np.dot(L, np.random.normal(size=(n,10)))
pl.figure(2)
pl.clf()
pl.plot(Xtest, f_prior)
pl.title('Ten samples from the GP prior')
pl.axis([-5, 5, -3, 3])
pl.savefig('prior.png', bbox_inches='tight')

# draw samples from the posterior at our test points.
L = np.linalg.cholesky(K_ + s * np.eye(n) - np.dot(Lk.T, Lk))
f_post = mu.reshape(-1,1) + np.dot(L, np.random.normal(size=(n,10)))
pl.figure(3)
pl.clf()
pl.plot(Xtest, f_post)
pl.title('Ten samples from the GP posterior')
pl.axis([-5, 5, -3, 3])
pl.savefig('post.png', bbox_inches='tight')

TODO: How to determine the hyperparameters s_y, kernel_parameter etc.

Python Library:

Literature:

Gaussian Processes

  • Lectures of Nando de Freitas: Machine Lerning: Gaussian processes for nonlinear regression, http://www.cs.ubc.ca/~nando/540-2013/lectures.html
  • Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.
  • Rasmussen, C. E., & Williams, C. K. (2006). Gaussian processes for machine learning (Vol. 1). Cambridge: MIT press.

Inversion of a positive semi-definite matrix by Cholesky decomposition