I’m going to change the notation I’ve been using so far to match that used in this paper, which would have been much better if it had been called One Weird Centering Trick. Apologies for the change, but the new notation is much better.
Call the state of the visible units . In Cid’s case, there is only one visible unit, so this vector has length one. Call the states of the variables in the first hidden layer . There are four of these, so this vector has length four. Call the states of the variables in the second hidden layer . There are also four of these. Recall that the states of all of these units are binary, so these vectors are vectors of bits.
If there are visible units, units in the first hidden layer, and units in the second hidden layer, there are a total of possible states of the network (so in the case above that would be possible configurations).
Call the biases on the visible units , the biases on the first hidden layer , and the biases on the second hidden layer . The connection weights between the visible and first hidden layers will be and between the first and second hidden layers .
are called model parameters. All of these are built out of real numbers.
With this new and better notation, our specific DBM looks like
where groups all the model parameters. The way to understand this is that given a particular setting of all the real numbers in , any setting of the binary values for all of gives a real number (the energy ). The lower the energy of a state, the more probable it is. If the system is thermally equilibrated at temperature , the probabilities are
where is the partition function. For a simple system like Cid’s brain, the partition function can be computed directly, as there are only possible states of the network. However the direct computation of the partition function is usually not possible for the sizes of networks we’ll eventually want to work with — even a relatively small network of say a thousand units has possible configurations.
Some Technology We’ll Use: An Alternating Gibbs Sampler
It turns out that all we need to do to train our DBM is to draw samples from two different distributions. The first is the ‘freely running’ distribution of the network above, where we set the model parameters and draw samples from the full network, and we assume that the probability distribution is Boltzmann (i.e. equilibrated at some temperature). The second is the distribution where the visible units have been clamped to an input data object, where again we assume that this ‘clamped’ distribution is in thermal equilibrium at the same temperature. If we can do these two things, then there is a simple prescription for alternating between these two types of sampling, where the KL-divergence between the probability distribution over the incoming data (the signals from the EU) and the probability distribution generated by the brain (the freely running distribution) is minimized.
But there is a problem with this. Reaching thermal equilibrium in a big network is very computationally expensive. Given a set of model parameters, there is no quick way to generate the right kind of distributions.
But we can approximate these distributions under certain circumstances. Recall that the layered structure of our network means that odd and even layers are conditionally independent — in other words, given a setting of all the variables in one of these two, we can compute what the probabilities are of the other set. To see why this is, take a look at the picture to the right, where you can see how this works for a 5-layer DBM (the important part to note is that the groups within the dashed boxes have no internode connections, and if all the variables in one box are clamped, the probabilities of each node in the other box can be computed exactly conditioned on those variables). We can then alternate between clamping one side and computing the other. This alternation can be done using something called Gibbs sampling. You might expect that this procedure will usually generate a distribution that is close to the one we actually want, although it’s not clear that this will always be the case.
The equations for how to do alternating Gibbs sampling for our DBM are pretty simple. Define to be such that the variable is drawn randomly from a Bernoulli distribution of parameter . In other words, the value of will be 1 with probability , and 0 otherwise. Define the function . Then if we assume all the variables on one side of our bipartite set-up are fixed, each variable on the other side will have value 1 with probability sigm(bias on the variable in question + the sum of the values of the variables it’s connected to times the connection weights between them). You can see how this is derived by looking in Section 2 of the Salakhudinov and Hinton paper.
Specifically for our three layer DBM, the equations for doing alternating Gibbs sampling are as follows. First, let’s do the case where the visible unit layer is left unclamped. We can do the sampling by first fixing the values of the first hidden layer to random values, then in parallel do the following: (a) compute the probabilities of the visible units given these, then drawing samples and based on these assigning specific values to the visible units; and (b) compute the probabilities of the second hidden layer units, and then drawing samples and assigning values to these. These can be parallelized as they are independent. After values for both and have been obtained, probabilities for the first hidden layer can then be computed conditioned on these. We then draw samples, assign values to , and then repeat this whole process a bunch of times. The equations for these are
For the second case we need, the procedure is similar except the visible layer is clamped to the value of a data object during the procedure. This changes the equations for the other two layers to
Note that even through we’re looking specifically at a three-layer DBM, the same ideas hold for arbitrary depth DBMs as the even and odd layers are always conditionally independent as long as we enforce the layer-wise structure.
Another thing I should mention is the computation of the freely running part can compute several ‘trajectories’ in parallel, corresponding to different samples being assigned to the nodes of the network but having the same probabilities. In the literature, each of these trajectories is called a ‘fantasy particle’.
OK here goes!
I have never used matplotlib before. You may be able to tell from watching this video I created using it. Here the units in the network are black if they are 1, and white if they are 0. They are surrounded by a halo with a color indicating the value of the bias at that unit, and are connected by dashed lines whose color indicates the value of the connection weight between the two units. The text at the top shows the cumulative probability of the network generating a ‘1’ at the visible unit over the training as the training progresses. The (single) visible unit is at the bottom, followed by the first hidden layer and then the second hidden layer on the top.
What’s shown here are 30 frames per second for one minute. Each frame consists of a step where a specific piece of training data comes in, and the weights and biases are adjusted based on that piece of data according to the rules we set out for training DBMs in the previous posts. This mimics a ‘single pixel eye with one bit resolution’ capable of seeing 30 frames per second, with a brain behind it that is learning over this data for one minute.
Recall that we arbitrarily set the ‘correct’ probability of seeing 1 to 0.135. If Cid’s internal model can match this probability he will have become Enlightened in the context of the External Universe we’ve created for him.
That was cool!! Go Cid.
Recall that we’d defined a measure of intelligence, based on the inverse of the KL-divergence. Here’s a plot of Cid’s intelligence as a function of number of training samples we’ve shown him for a few different training runs. Training Boltzmann Machines makes Christmas fun!!!