Reaching Enlightenment II.

I’m going to change the notation I’ve been using so far to match that used in this paper, which would have been much better if it had been called One Weird Centering Trick. Apologies for the change, but the new notation is much better.

Call the state of the visible units $\vec{x}$. In Cid’s case, there is only one visible unit, so this vector has length one. Call the states of the variables in the first hidden layer $\vec{y}$. There are four of these, so this vector has length four. Call the states of the variables in the second hidden layer $\vec{z}$. There are also four of these. Recall that the states of all of these units are binary, so these vectors are vectors of bits.

If there are $M_x$ visible units, $M_y$ units in the first hidden layer, and $M_z$ units in the second hidden layer, there are a total of $2^{M_x + M_y + M_z}$ possible states of the network (so in the case above that would be $2^{1 + 4 + 4} = 512$ possible configurations).

Call the biases on the visible units $\vec{a}$, the biases on the first hidden layer $\vec{b}$, and the biases on the second hidden layer $\vec{c}$. The connection weights between the visible and first hidden layers will be $\hat{W}$ and between the first and second hidden layers $\hat{V}$.

$\vec{a}, \vec{b}, \vec{c}, \hat{W}, \hat{V}$ are called model parameters. All of these are built out of real numbers.

With this new and better notation, our specific DBM looks like

and the energy of its possible configurations is

$E(\vec{x}, \vec{y}, \vec{z}; \theta) = - \vec{y}^T \hat{W} \vec{x} - \vec{z}^T \hat{V} \vec{y} - \vec{x}^T \vec{a} -\vec{y}^T \vec{b} - \vec{z}^T \vec{c}$

or equivalently

$E(\vec{x}, \vec{y}, \vec{z}; \theta) = - \sum_{i=0}^{M_x} \sum_{j=0}^{M_y} y_j W_{ij} x_i -\sum_{j=0}^{M_y} \sum_{k=0}^{M_z} z_{k} V_{jk} y_{j}$

$- \sum_{i=0}^{M_x} a_i x_i - \sum_{j=0}^{M_y} b_j y_j - \sum_{k=0}^{M_z} c_k z_k$

where $\theta = \{ \hat{W}, \hat{V}, \vec{a}, \vec{b}, \vec{c} \}$ groups all the model parameters. The way to understand this is that given a particular setting of all the real numbers in $\theta$, any setting of the binary values for all of $\vec{x}, \vec{y}, \vec{z}$ gives a real number (the energy $E$). The lower the energy of a state, the more probable it is. If the system is thermally equilibrated at temperature $T$, the probabilities are

$P(\vec{x}, \vec{y}, \vec{z}; \theta) = {{1}\over{{\cal Z}}} \exp(-E(\vec{x}, \vec{y}, \vec{z}; \theta)/T)$

where ${\cal Z}$ is the partition function. For a simple system like Cid’s brain, the partition function can be computed directly, as there are only $2^9$ possible states of the network. However the direct computation of the partition function is usually not possible for the sizes of networks we’ll eventually want to work with — even a relatively small network of say a thousand units has $2^{1000}$ possible configurations.

Some Technology We’ll Use: An Alternating Gibbs Sampler

It turns out that all we need to do to train our DBM is to draw samples from two different distributions. The first is the ‘freely running’ distribution of the network above, where we set the model parameters and draw samples from the full network, and we assume that the probability distribution is Boltzmann (i.e. equilibrated at some temperature). The second is the distribution where the visible units have been clamped to an input data object, where again we assume that this ‘clamped’ distribution is in thermal equilibrium at the same temperature. If we can do these two things, then there is a simple prescription for alternating between these two types of sampling, where the KL-divergence between the probability distribution over the incoming data (the signals from the EU) and the probability distribution generated by the brain (the freely running distribution) is minimized.

But there is a problem with this. Reaching thermal equilibrium in a big network is very computationally expensive. Given a set of model parameters, there is no quick way to generate the right kind of distributions.

In a DBM, the layered structure allows us to segment the variables into two groups, each of which has no internal connections.

But we can approximate these distributions under certain circumstances. Recall that the layered structure of our network means that odd and even layers are conditionally independent — in other words, given a setting of all the variables in one of these two, we can compute what the probabilities are of the other set. To see why this is, take a look at the picture to the right, where you can see how this works for a 5-layer DBM (the important part to note is that the groups within the dashed boxes have no internode connections, and if all the variables in one box are clamped, the probabilities of each node in the other box can be computed exactly conditioned on those variables). We can then alternate between clamping one side and computing the other. This alternation can be done using something called Gibbs sampling. You might expect that this procedure will usually generate a distribution that is close to the one we actually want, although it’s not clear that this will always be the case.

The equations for how to do alternating Gibbs sampling for our DBM are pretty simple. Define $x \sim B(p)$ to be such that the variable $x$ is drawn randomly from a Bernoulli distribution of parameter $p$. In other words, the value of $x$ will be 1 with probability $p$, and 0 otherwise. Define the function $sigm(x) = 1/(1+\exp(-x))$. Then if we assume all the variables on one side of our bipartite set-up are fixed, each variable on the other side will have value 1 with probability sigm(bias on the variable in question + the sum of the values of the variables it’s connected to times the connection weights between them). You can see how this is derived by looking in Section 2 of the Salakhudinov and Hinton paper.

Specifically for our three layer DBM, the equations for doing alternating Gibbs sampling are as follows. First, let’s do the case where the visible unit layer $\vec{x}$ is left unclamped. We can do the sampling by first fixing the values of the first hidden layer $\vec{y}$ to random values, then in parallel do the following: (a) compute the probabilities of the visible units given these, then drawing samples and based on these assigning specific values to the visible units; and (b) compute the probabilities of the second hidden layer units, and then drawing samples and assigning values to these. These can be parallelized as they are independent. After values for both $\vec{x}$ and $\vec{z}$ have been obtained, probabilities for the first hidden layer $\vec{y}$ can then be computed conditioned on these. We then draw samples, assign values to $\vec{y}$, and then repeat this whole process a bunch of times. The equations for these are

$\vec{x} \sim B(sigm(\hat{W}^T \vec{y} + \vec{a}))$

$\vec{z} \sim B(sigm(\hat{V} \vec{y} + \vec{c}))$

$\vec{y} \sim B(sigm(\hat{W}\vec{x}+\hat{V}^{T}\vec{z}+\vec{b}))$

For the second case we need, the procedure is similar except the visible layer is clamped to the value of a data object during the procedure. This changes the equations for the other two layers to

$\vec{z} \sim B(sigm(\hat{V} \vec{y} + \vec{c}))$

$\vec{y} \sim B(sigm(\hat{W}\vec{x}_{data}+\hat{V}^{T}\vec{z}+\vec{b}))$

Note that even through we’re looking specifically at a three-layer DBM, the same ideas hold for arbitrary depth DBMs as the even and odd layers are always conditionally independent as long as we enforce the layer-wise structure.

Another thing I should mention is the computation of the freely running part can compute several ‘trajectories’ in parallel, corresponding to different samples being assigned to the nodes of the network but having the same probabilities. In the literature, each of these trajectories is called a ‘fantasy particle’.

OK here goes!

I have never used matplotlib before. You may be able to tell from watching this video I created using it. Here the units in the network are black if they are 1, and white if they are 0. They are surrounded by a halo with a color indicating the value of the bias at that unit, and are connected by dashed lines whose color indicates the value of the connection weight between the two units. The text at the top shows the cumulative probability of the network generating a ‘1’ at the visible unit over the training as the training progresses. The (single) visible unit is at the bottom, followed by the first hidden layer and then the second hidden layer on the top.

What’s shown here are 30 frames per second for one minute. Each frame consists of a step where a specific piece of training data comes in, and the weights and biases are adjusted based on that piece of data according to the rules we set out for training DBMs in the previous posts. This mimics a ‘single pixel eye with one bit resolution’ capable of seeing 30 frames per second, with a brain behind it that is learning over this data for one minute.

Recall that we arbitrarily set the ‘correct’ probability of seeing 1 to 0.135. If Cid’s internal model can match this probability he will have become Enlightened in the context of the External Universe we’ve created for him.

That was cool!! Go Cid.

Recall that we’d defined a measure of intelligence, based on the inverse of the KL-divergence. Here’s a plot of Cid’s intelligence as a function of number of training samples we’ve shown him for a few different training runs. Training Boltzmann Machines makes Christmas fun!!!

9 thoughts on “Reaching Enlightenment II.”

• Great, glad to hear it!

1. Hi Geordie,
Don’t you have a better way of training this? Can’t you maximize the boltzmann distribution directly with a Dwave? I thought that was where this was all leading…

• That’s where it is going, but first things first. Too many people use code without understanding what it’s doing. I like to really understand what people are doing.

2. hi jumping into this late & am gonna look at your earlier blogs on this but
a) its not clear what the training data is here
b) to me a big part of real intelligence is **motor function** and am maybe developing a theory that one cannot have much intelligence without interaction with the environment. in fact am thinking lately that intelligence is actually a sort of homeostatic regulation system of the entity with its environment…. also, have a big recent blog/survey on deep learning, will post the link direct when you blog on this next & catch it current….

3. This is definitely the best series on the BM topic I ‘ve ever seen so far.

• Thanks for the feedback!! Doing this helped me understand how they work — it was fun!