Montavon’s code — the DBM class

Last time we did a review of Gregoire Montavon’s code for training DBMs, but left the most important part for last. Let’s take a look at the last bit. This is the heart of the code.

The DBM class and its member variables

Here is the front matter and the initialization of the member variables for the class.

# ====================================================
# Centered deep Boltzmann machine
# ----------------------------------------------------
# - self.W: list of weight matrices between layers
# - self.B: list of bias associated to each unit
# - self.O: list of offsets associated to each unit
# - self.X: free particles tracking model statistics
# ====================================================

class DBM:
# --------------------------------------------
# Initialize model parameters and particles
# --------------------------------------------
def __init__(self, M, B):
     self.W = [numpy.zeros([m, n]).astype('float32') for m, n in zip(M[:-1], M[1:])]
     self.B = [numpy.zeros([m]).astype('float32') + b for m, b in zip(M, B)]
     self.O = [sigm(b) for b in self.B]
     self.X = [numpy.zeros([mb, m]).astype('float32') + o for m, o in zip(M, self.O)]

Last time we saw that the inputs to the class were a list of the number of variables in each layer (that’s M above — in the case we looked at, M was the list [784, 400, 100]), and a list of biases on the variables (that’s B above). The list of biases has three entries. The first is an array containing all individual biases on the visible units. The second is a single number which is the common bias on all hidden layer 1 units. The third is a single number which is the common bias on all hidden layer 2 units. So B is a list of length 3 in the case we looked at last.

Now let’s take a look at how the member variables are initialized. First some pythonic notation. M[:-1] means the list M without its last entry. So for the MNIST example M[:-1] = [784, 400]. M[1:] means the list without the zeroth entry — that would be M[1:] = [400, 100]. The function zip(A, B) acts like a zipper — it returns a list of tuples, where each tuple contains the ith element of A and the ith element of B. For example, zip([1,2,3], [4,5,6]) = [(1,4), (2,5), (3,6)]. So zip(M[:-1], M[1:]) returns a list of tuples, each of which contains the ith and (i+1)th number of variables in the deep net. For example, if we had a deep net containing 5 layers, with M = [784, 400, 100, 50, 10], then M[:-1] = [784, 400, 100, 50] and M[1:] = [400, 100, 50, 10], and zip(M[:-1], M[1:]) = [ (784, 400), (400, 100), (100, 50), (50, 10)]. Isn’t that clever? I love stuff like this.

Alright so now we can see what the first member variable looks like. self.W = [numpy.zeros([m, n]).astype(‘float32’) for m, n in zip(M[:-1], M[1:])] creates a list of arrays. Each array is initially filled with zeros and will hold the connection weights we are trying to learn. There is one array for each pair of connected layers in the network. Each has size [m,n] where m is the number of variables in the lower level and n is the number in the upper level. In the case where M = [784, 400, 100] this means that self.W is a list with two entries, the first being an array of size [784, 400] and the second being an array of size [400,100]. The data type of the weights is set to float32.

Next let’s take a look at the bias term. Here we have self.B = [numpy.zeros([m]).astype(‘float32’) + b for m, b in zip(M, B)]. The zip command creates a set of tuples. The first of these is (784, array_containing_biases_of_length_784), the second is (400, -1) and the third is (100, -1). The initialization command therefore creates a list with 3 entries. The first is a 784×1 array initialized with the biases on the visible units; the second a 400×1 array with each element equal to -1; the third is a 100×1 array with each element equal to -1. This is also clever, as the same command is used on both array-style and integer entries in B. If you wanted to create individualized initial biases on the hidden layers, the exact same command should work, but instead of an integer (like -1) you feed it an array of the correct dimension.

OK so after this is done, self.B is a list with 3 entries. The zeroth entry is an array containing the biases on the visible units. The first entry is an array containing the biases on the first hidden layer units. The second entry is an array containing the biases on the second hidden layer units.

Next we initialize the member variable that will hold the offsets. These offsets are used to ‘center’ the DBM, which is useful for making the optimization better behaved. This one is pretty simple. self.O = [sigm(b) for b in self.B] creates a list containing three elements. The first applies the sigm function to all entries in the first entry of self.B; the second does the same to the second entry of self.B; and the third does the same to the third entry of self.B. The reason the sigm() function is used is that the offsets are initially chosen to represent the mean value of the variable, given the bias at that variable and assuming all the weights connected to it are zero. The sigm() function does that for us (assuming Boltzmann statistics).

Lastly we initialize the self.X member variable. The zip(M, self.O) command creates a list of three tuples of the form [(784, array_containing_offsets_on_visibles), (400, array_containing_offsets_on_first_hidden_layer), (100, array_containing_offsets_on_second_hidden_layer)].

When self.X = [numpy.zeros([mb, m]).astype(‘float32’) + o for m, o in zip(M, self.O)] is called, it creates a list with three entries. These have sizes [mb, 784], [mb, 400] and [mb, 100], where recall mb is a global variable specifying the minibatch size (the number of data examples used in each cycle of learning). In each of these, each row is initialized to the offset numbers in the corresponding entry of self.O — each of the mb rows in each of the 3 arrays in self.X is initially identical.

Gibbs sampling

The first method defined in the class is .gibbs(X, l). Let’s check it out.

# --------------------------------------------
# Gibbs activation of a layer
# --------------------------------------------
def gibbs(self, X, l):
     bu = numpy.dot(X[l-1]-self.O[l-1], self.W[l-1]) if l > 0 else 0
     td = numpy.dot(X[l+1]-self.O[l+1], self.W[l].T) if l+1 < len(X) else 0
     X[l] = realize(sigm(bu + td + self.B[l]))

The method takes two input arguments. The second of these is the layer of the network (that’s l, which is an integer), where l=0 is the visible layer, l=1 is the first hidden layer, l=2 is the second hidden layer, etc.

Inside the method you can see two quantities being defined — these stand for ‘bottom up’ and ‘top down’. These are conditionally defined. If you take a look at the first one (bu), you can see that bu=0 if l=0, otherwise it’s the specified dot product. This means that if l=0 (the visible layer), there is no layer below it so the ‘bottom up’ contribution is zero. Similarly for the top down quantity, if we are in the topmost layer of the network, there is no layer above that, so td is set to zero, otherwise it’s the dot product.

The first input argument to the method (X) is a list of arrays. We can see from the structure of the method that this list should have the same number of elements as self.O — that is, there should be the same number of elements as the number of layers in the network. There is a very clever trick that’s being played here about what the actual size of the arrays inside the list X are. Note that the command X[l-1]-self.O[l-1] is sensible as long as the number of columns in X[l-1] matches the length of self.O[l-1] — X[l-1] can have an arbitrary number of rows and this will still work. If there is more than one row in X[l-1] then each row gets self.O[l-1] subtracted off. We’ll see how this flexibility is used to good effect shortly!

Alright so let’s look at the dot product defining bu. We have bu = numpy.dot(X[l-1]-self.O[l-1], self.W[l-1]) if l > 0 else 0. So let’s assume l>0. What’s the dot product doing?

The size of the first term will be the same size as X[l-1]. The weight array W[l-1] is the array of weights immediately below layer l. If we imagine that each row of the array X holds the states of the variables in the l-1 layer, and the array self.O[l-1] are the offsets on those, then (X-O)*W are the effects of the l-1 layer variables felt by the variables on layer 1 — the ‘bottom up’ effect.

Similarly the ‘top down’ quantity does the exact same thing, but propagates the influence of the layer above down to layer l.

To see where the forms of these dot products originate, take a look at Equation 3 of the Montavon paper — for a three layer network, the .gibbs() method is doing the work behind updating y_d, z_d, x_m, y_m and z_m. Whenever you see two entries in these, we’re at layer 1 and there are both top down and bottom up contributions. Whenever there is only one, we’re either at layer 0 and there’s only a top down, or at layer 3 and there’s only a bottom up.

The final call in the method is X[l] = realize(sigm(bu + td + self.B[l])). This is exactly the form of the expressions in Equation 3, where X[0] = x, X[1] = y, and X[2] = z. This updates the list of arrays X. The effect of doing this is a little confusing as there are two different types of input arrays X that will be sent in — we’ll get to this shortly!

The reparametrization method

Next we’ll look at the methods for changing the offset parameters. Here’s the code:

# --------------------------------------------
# Reparameterization
# --------------------------------------------
def reparamB(self, X, i):
     bu = numpy.dot((X[i-1]-self.O[i-1]), self.W[i-1]).mean(axis=0) if i > 0 else 0
     td = numpy.dot((X[i+1]-self.O[i+1]), self.W[i].T).mean(axis=0) if i+1 < len(X) else 0
     self.B[i] = (1-rr)*self.B[i] + rr*(self.B[i] + bu + td)

def reparamO(self, X, i):
     self.O[i] = (1-rr)*self.O[i] + rr*X[i].mean(axis=0)

As in the gibbs method, for both of these there are two input variables, where the second is a layer index, which he calls i here (instead of l). The structure of the first of these is very similar to the gibbs method. Note however that there is a .mean(axis=0) call for both bu and td. The reparamB method is doing the job in Equation 3 of modifying \vec{a}, \vec{b} and \vec{c} due to a non-zero $\latex \nu$ (the part proportional to \eta is dealt with separately). He writes it in a funny way but it’s just the stuff proportional to \nu in Eq. 3. The reason we need the .mean(axis=0) call is that the first input X can have an arbitrary number of rows. The .mean call averages over all these rows.

The reparamO method does the job of changing the offset parameters, which in Equation 3 are the vectors \vec{\alpha}, \vec{\beta} and \vec{\gamma}. The .mean() call averages over all of the rows of the input X.

The last step — the learning step!

OK we’re now at the last part of this, where we can tie this up and some of the more mysterious parts of the previous steps will be clarified. Here’s the .learn() method:

# --------------------------------------------
# Learning step
# --------------------------------------------
def learn(self, Xd):

# Initialize a data particle
X = [realize(Xd)]+[self.X[l]*0+self.O[l] for l in range(1, len(self.X))]

# Alternate gibbs sampler on data and free particles
for l in (range(1, len(self.X), 2)+range(2, len(self.X), 2))*5:
     self.gibbs(X, l)
for l in (range(1, len(self.X), 2)+range(0, len(self.X), 2))*1:
     self.gibbs(self.X, l)

# Parameter update
for i in range(0, len(self.W)):
     self.W[i] += lr*(numpy.dot((X[i]-self.O[i]).T, X[i+1]-self.O[i+1]) - numpy.dot((self.X[i]-self.O[i]).T, self.X[i+1]-self.O[i+1]))/len(Xd)
for i in range(0, len(self.B)):
     self.B[i] += lr*(X[i]-self.X[i]).mean(axis=0)

# Reparameterization
for l in range(0, len(self.B)):
     self.reparamB(X, l)
for l in range(0, len(self.O)):
     self.reparamO(X, l)

Alright, all the rest was just preliminaries — this is where the rubber hits the road!!! The .learn method takes one input (Xd). Going back to the Example of Execution section, we see that the one time this method is called, Xd is an mb row, 784 column array where the rows have been randomly selected from the entire training set.

The first bit is

# Initialize a data particle
X = [realize(Xd)]+[self.X[l]*0+self.O[l] for l in range(1, len(self.X))]

Recall that realize(Xd) generates a random number for each entry in Xd, and if the number is less than the entry, a 0 is returned for that entry otherwise it’s set to 1. So [realize(Xd)] returns a list with a single entry, which is an mbx784 array of 0.0 and 1.0 values. To see what the second term is, recall that len(self.X) is the number of layers in the network — in this case that would be 3. So this second term has two entries — self.O[1] and self.O[2]. X is now a three element list, where the first element is mbx784, the second is self.O[1] and the third is self.O[2].

The second bit is

# Alternate gibbs sampler on data and free particles
for l in (range(1, len(self.X), 2)+range(2, len(self.X), 2))*5:
     self.gibbs(X, l)
for l in (range(1, len(self.X), 2)+range(0, len(self.X), 2))*1:
     self.gibbs(self.X, l)

The first loop, with len(self.X)=3, iterates over l=1 and l=2 five times. The five number is the number of alternating Gibbs sampler steps we take to equilibrate the network. Using X as input means that for this loop, we are running the Gibbs sampler with the visible units fixed to the data values — the ones in the first entry of X.

Let’s go through this step by step. The first time it’s called, we have self.gibbs(X, 1). Because we start with layer 1, the gibbs method will compute both bu and td quantities. Let’s see what they are! bu = numpy.dot(X[0]-self.O[0], self.W[0]). X[0] holds the mbx784 data, and self.O[0] are the offsets in that bottom layer. self.W[0] are the weights connecting the visibles and the first hidden layer (these are initially zero). td = numpy.dot(X[2]-self.O[2], self.W[1].T). Here X[2] is self.O[2], so the initial td value is zero. We see also why he added the self.X[l]*0 term — I think if it wasn’t there the dot product dimensionality would be off.

After these, we get X[1] = realize(sigm(bu + td + self.B[1])). Both bu and td are initially zero, so X[1]=realize(sigm(self.B[1])), which assigns 1s and 0s to create a binary array of length 400 (the number of hiddens in the first layer).

Next self.gibbs(X, 2) is called. The same steps are repeated, except now td is zero and bu = numpy.dot(X[1]-self.O[1], self.W[1]). X[1] is as found previously, self.O[1] are the offsets on the first hidden layer, and self.W[1] are the weights between the second and first hidden layers. X[2] = realize(sigm(bu + self.B[2])) is now set.

Then self.gibbs(X,1) is called again. Now X[2] has changed, and therefore the returned value of X[1] will change. Then self.gibbs(X,2) is called. Etc. This is repeated a total of five times.

Note that this procedure is carrying along mb different data objects and acting on them in parallel.

Now the next step performs the alternating Gibbs procedure on self.X. self.X is different than X. X holds the data we’re learning over, which changes each time we run a new minibatch, whereas self.X is a member variable of the DBM class and persists throughout the entire learning process. self.X holds the ‘fantasy particles’ that are supposed to represent the freely running network. self.X is initially [numpy.zeros([mb, m]).astype(‘float32’) + o for m, o in zip(M, self.O)].

The loop iterates over l=1, then l=0, then l=2 — so the default is to just do one pass through the alternating Gibbs sampler without the visibles clamped.

Alright we’re almost there! The last bit of the method updates the weights, biases and offsets. Here’s the code:

# Parameter update
for i in range(0, len(self.W)):
     self.W[i] += lr*(numpy.dot((X[i]-self.O[i]).T, X[i+1]-self.O[i+1]) - numpy.dot((self.X[i]-self.O[i]).T, self.X[i+1]-self.O[i+1]))/len(Xd)
for i in range(0, len(self.B)):
     self.B[i] += lr*(X[i]-self.X[i]).mean(axis=0)

# Reparameterization
for l in range(0, len(self.B)):
     self.reparamB(X, l)
for l in range(0, len(self.O)):
     self.reparamO(X, l)

To see where these come from, go back to Equation 3 in the Montavon paper. Note that the x_d, y_d, z_d quantities are represented by X whereas x_m, y_m, z_m are represented by self.X. The division by len(Xd) normalizes over the mb objects. The .mean(axis=0) call averages over the mb rows.

Whew that was a lot of code review. But now I think we understand the entire thing!

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s