# Gregoire Montavon’s code for training a centered DBM

In the previous post, I showed a movie of how some of our DBM’s parameters changed over time during training. Now I’ll show you how I got those numbers. The core of the code was written by Gregoire Montavon, and can be found here. This code implements the ideas in the Deep Boltzmann Machines and the Centering Trick paper.

I’d originally intended to write the DBM training code from scratch, and I’m shortly going to do that anyway so we can run the training in hardware, but Gregoire’s code contains some really cool tricks that I’d like to go over. It took me a while to figure out how the code worked, because it’s so compact.

The code is deceptively short, so I’m just going to go through it section by section and provide some hopefully helpful editorial comments along the way.

Code Review 1. Front Matter

 # ----------------------------------------------------
# Training a Centered Deep Boltzmann Machine
# ----------------------------------------------------
#
#
# This code is released under the MIT licence:
#
# ----------------------------------------------------
#
# This code is based on the paper:
#
#   G. Montavon, K.-R. Mueller
#   Deep Boltzmann Machines and the Centering Trick
#   in Neural Networks: Tricks of the Trade, 2nd Edn
#   Springer LNCS, 2012
#
# ----------------------------------------------------
#
# This code is a basic implementation of the centered
# deep Boltzmann machines (without model averaging,
# minibatching and other optimization hacks). The code
# also requires the MNIST dataset that can be
#
# ----------------------------------------------------


If you want to run the code ‘out of the box’, you need to download and unpack the MNIST dataset, and remember where you put it.

Code Review 2. Imports, global parameters and helper functions


import numpy,Image

# ====================================================
# Global parameters
# ====================================================
lr      = 0.005     # learning rate
rr      = 0.005     # reparameterization rate
mb      = 25        # minibatch size
hlayers = [400,100] # size of hidden layers
biases  = [-1,-1]   # initial biases on hidden layers

# ====================================================
# Helper functions
# ====================================================
def arcsigm(x): return numpy.arctanh(2*x-1)*2
def sigm(x):    return (numpy.tanh(x/2)+1)/2
def realize(x): return (x &amp;gt; numpy.random.uniform(0,1,x.shape))*1.0
def render(x,name):
x = x - x.min() + 1e-9
x = x / (x.max() + 1e-9)
Image.fromarray((x*255).astype('byte'),'L').save(name)


Imports are straightforward, just numpy and Image. Image is required because the code visualizes some things as it proceeds. It does this by outputting images at regular intervals.

The global parameters include two rates. The first, lr, is the learning rate $\eta$ in the paper. This is the ‘regular’ learning rate in a DBM. The second, rr, is the reparametrization rate $\nu$ in the paper. This tells us how quickly to change the centering quantities $\alpha, \beta$ and $\gamma$. To see where $\eta$ and $\nu$ enter into the algorithm, see Figure 3 in the paper. I haven’t talked about the centering idea before, but it’s really simple — the concept is that each node in the network is reparametrized such that the variable representing it is shifted by an amount equal to its mean value, so that the new variable is ‘centered’ around zero. Doing this doesn’t change anything as it just shifts the entire energy function by a constant, but it makes the optimization much better behaved.

The next global is mb, which is the minibatch size. This sets the number of data objects the learning algorithm will consider for each cycle of training (that is, reconfiguring the biases, weights and offsets). In the past few posts, I’ve just been assuming that this quantity is equal to 1, i.e. Cid only gets to see one data object per training cycle, but it’s straightforward to instead send in a batch and run each training cycle over a batch instead of a single instance.

Next is hlayers, where the number of hidden units in each layer are specified — the leftmost number is the number in hidden layer 1, and the rightmost in hidden layer 2. For Cid this was [4, 4].

Last is biases. The way this implementation works, the first of these will be the initial bias shared by all hidden units in layer 1, and the second will be the initial bias shared by all hidden units in layer 2. So for the code above, $\vec{b} = [-1, -1, ..., -1]$ and $\vec{c} = [-1, -1, ..., -1]$ where $\vec{b}$ has 400 entries and $\vec{c}$ has 100 entries.

The helper functions contain two useful functions when working with Boltzmann distributions, arcsigm and sigm. They are written over tanh instead of exp for numerical stability reasons, but you can check that they are equivalent.

The realize(x) function is interesting. Here’s what it does. Assume you have an array X, where each element of X is a real number between 0 and 1 (these are probabilities in this context). If you call X0 = realize(X), what happens is that a random array with entries in the range (0,1) is created of the same size as X. Then for each entry in X, the corresponding random number is compared to that entry. If the random number is less than the entry, a zero is returned in the corresponding entry in X0, otherwise it’s set to 1. The returned array X0 contains only 0 and 1 elements, and is a sample from the probability distribution provided by X. This is a nice way of performing the B(p) operation — sampling from a Bernoulli distribution, where the input is an arbitrarily sized array.

The render function takes as input a two dimensional array x and a name. It then rescales the entries of the array to be within the range (0,1), and produces a greyscale image of these by first increasing the range to (0,255) by multiplying by 255, then converting floats to byte, and saving as greyscale (that’s the ‘L’ bit) as whatever name you fed it.

Code review 3. Example of execution

I’m going to do this out of order and leave the actual DBM class to last. Let’s take a look now at the “Example of execution” section.


# ====================================================
# Example of execution
# ====================================================

# Initialize MNIST dataset and centered DBM
X = (numpy.fromfile(open('train-images.idx3-ubyte','r'),dtype='ubyte',count=16+784*60000)[16:].reshape([60000,784])).astype('float32')/255.0
nn = DBM([784]+hlayers,[arcsigm(numpy.clip(X.mean(axis=0),0.01,0.99))]+biases)

for it in range(1000):

# Perform some learning steps
for _ in range(100): nn.learn(X[numpy.random.permutation(len(X))[:mb]])

# Output some debugging information
print((&quot;%03d |&quot; + &quot;%.3f &quot;*len(nn.W))%tuple([it]+[W.std() for W in nn.W]))
W = 1
for l in range(len(nn.W)):
W = numpy.dot(W,nn.W[l])
m = int(W.shape[1]**.5)
render(W.reshape([28,28,m,m]).transpose([2,0,3,1]).reshape([28*m,28*m]),'W%d.jpg'%(l+1));
render((nn.X[0]).reshape([mb,28,28]).transpose([1,0,2]).reshape([28,mb*28]),'X.jpg');


The first thing Gregoire does is to load in the training data into the X array. In his example, this data is the MNIST training set, which comprises a total of 60,000 28×28 pixel greyscale images of handwritten digits. Let’s unpack that first line a bit to see what’s going on.

The function numpy.fromfile() is interesting, I hadn’t seen this before but it looks very useful. Here’s the documentation for it. It takes three arguments in this case. The first is the file where your data lives (note: if you want to run this code, here is where you have to specify where the MNIST file lives). The second is the datatype of the returned array, which here is specified as ‘ubyte’ (unsigned byte). The third is the number of items to read. So the numpy.fromfile() in this case says read 16+784*60000 items from the MNIST file and return an array where each of these is in ubyte format. Note that 784 = 28×28, and the 16 is there because the file has some front matter junk.

After this is done, Gregoire asks for all entries in this returned array from the 16th onwards (that’s to exclude the first 16 entries comprising the non-data front matter). That’s the [16:] part. Then he asks for the array to be reshaped to have 60,000 rows and 784 columns (that’s the .reshape([60000, 784]) part). This makes it so that each row of the array contains a training example. He then recasts the array as type ‘float32’, and then divides every element by 255.0, so that each element of the array now lives in the range (0,1) (note that MNIST, being greyscale images, has elements defined in the range (0,255)). After all of this, we end up with an array X with 60,000 rows, 784 columns, with each entry a float32 in the range (0,1).

The next line is where he creates a DBM object. We haven’t looked at the DBM class yet, but let’s take a look at what the inputs look like here. There are two of them. The first is [784] + hlayers. What this does is it appends the entries in hlayers to 784 in a list. So the first input to DBM is a list comprising the number of units in each layer (in this case, there are 784 visible units, one per pixel in the MNIST images). So [784] + hlayers would be the list [784, 400, 100] in the out of the box example here.

The second input is more interesting. Recall that X is the array holding all available training data, where each row of X contains a single data element. The operation X.mean(axis=0) creates a new array holding the mean value across the entire training set on a pixel by pixel basis (axis=0 refers to the rows, axis=1 is the columns), so X.mean(axis=0) is a one row by 784 column array where each entry is the average value of that pixel.

numpy.clip() takes an input array and ‘clips’ it so that each entry is within a specified range. In this case, this is necessary because the average value of pixels in MNIST can be 0 or 1, because all of the images are surrounded by black borders and have white pixels near the middle. Having 0 or 1 values is not OK, because the next thing he does is take the arcsigm of these elements, and arcsigm(0) and arcsigm(1) are not defined.

The reason that he chooses to use these initial biases on the visible units is that, in the absence of the effects of the hidden units, this choice makes it so that the probabilities of each visible unit being on is equal to the mean value of the pixels over the training set — this is a reasonable place to start the learning process. The reason this is true is that the probability of an isolated node being on, given Boltzmann statistics, is $P(x=1) = sigm(a)$, where $a$ is the bias on the isolated node. If we then invert and solve for $a$ we get $a=arcsigm(P(x=1))$ (see the Montavon paper, Section 2 right above Equation 1). There is an interesting property of this data that’s being used here — rescaled greyscale images contain entries in the range (0,1). In this code, these entries are interchangeably used as scaled greyscale values in images and as probabilities.

So the second input to DBM is a list of length 784 comprising the average pixel values over the entire MNIST training set, clipped to within the range 0.01..0.99, to which is added the list biases, which is just a two element list. You may start to see what’s going on here — this second input to DBM will be the initial biases on each node in the DBM. The 784 element list is just the vector $\vec{a}$ (one entry per visible unit), while the second to last entry in the list will be the (shared) bias value for the entire vector $\vec{b}$, and the last entry will be the (shared) bias value for the entire vector $\vec{c}$.

Next Gregoire defines two loops. The outer loop is

for it in range(1000):


and the inner loop is

for _ in range(100): nn.learn(X[numpy.random.permutation(len(X))[:mb]])


Let’s start with the inner loop. nn is the object we initialized from the DBM class. This class has a method .learn(). Let’s take a look at what’s being fed to this method. Here len(X) is an integer — the number of rows in array X (which in this case is 60,000). The function numpy.random.permutation(), when fed an integer N, randomly permutes the sequence (0..N-1) and returns that permutation as a one dimensional array of integers. The [:mb] part only keeps the last mb elements of this array (recall mb is the ‘minibatch size’, whose default value in the code is 25, so this would be the last 25 elements of the randomly permuted set of integers up to 59,999). Let’s call this length mb array Y0. Then X[Y0] returns the rows of X corresponding to the entries in Y0. So the array that’s being fed to the .learn() method is an mb row, 784 column array where the rows have been randomly selected from the entire training set.

For each iteration of the outer loop, this inner loop is repeated 100 times (of course this is just a parameter so we could change this). Whatever .learn() is doing, it’s doing it 100 times in a row, each time taking as input a new set of randomly chosen mb images from the full training set.

Now back to the outer loop! After the inner loop is completed, Gregoire prints some information and saves some images to disk. Let’s see what he’s doing here. Here’s the bit we’ll look at now.

     # Output some debugging information
print((&quot;%03d |&quot; + &quot;%.3f &quot;*len(nn.W))%tuple([it]+[W.std() for W in nn.W]))
W = 1
for l in range(len(nn.W)):
W = numpy.dot(W,nn.W[l])
m = int(W.shape[1]**.5)
render(W.reshape([28,28,m,m]).transpose([2,0,3,1]).reshape([28*m,28*m]),'W%d.jpg'%(l+1));
render((nn.X[0]).reshape([mb,28,28]).transpose([1,0,2]).reshape([28,mb*28]),'X.jpg');


The first print command requires a property of the nn object called nn.W. I will jump ahead a bit here and tell you that nn.W is a list of arrays that store the weights in the Boltzmann machine (the numbers defined on the edges connecting nodes). The $k^{th}$ element of this list contains the weight array between the $k^{th}$ and $(k+1)^{th}$ layers. So for a three layer network, there are two elements to this list; nn.W[0] contains the weights between the visible (i.e. zeroth) layer and the first hidden layer, and nn.W[1] contains the weights between the first and second hidden layers.

The command [W.std() for W in nn.W] creates a list of floats. Each entry of this list is the standard deviation across all entries in the array W, for each W in nn.W. In our case, there are two entries in nn.W (nn.W[0] and nn.W[1]), so this would create a two-element list where the first element was the standard deviation of all the elements in nn.W[0] (i.e. the weights connecting the visible and first hidden layer), and the second element was the same but across all the elements of nn.W[1], the weights connecting the first and second hidden layers. The command tuple([it] + [W.std() for W in nn.W]) converts the list created by appending the stds to the iteration parameter it to a tuple, which in this case is a tuple over three elements (784, std(nn.W[0]), std(nn.W[1])). len(nn.W) is just the number of arrays in the list (in this case, two).

Alright so now we can see what this print command is doing. What it outputs is first the iteration number, followed by a separator |, then the standard deviation of the weights starting with the weights between the visible and first hidden layer, and then the first and second hidden layer.

Next there is a loop which is performed len(nn.W) times (which recall is the total number of layers minus one). The first time through the loop, l=0, and W is set to nn.W[0], the array containing the weights between the visible and first hidden layer. m is then set to the square root of the number of columns in nn.W[0]. This number will be the square root of the number of hidden units in the first layer (so square root of 400 -> 20). Given the weights and this number, the render() function is called.

The nn.W[0] array has size (784, 400). In order to plot it, Gregoire first resizes by calling .reshape([28,28,20,20]), and then permutes the axes of the reshaped array. The .transpose([2,0,3,1]) call returns the reshaped array with the second axis first, the first axis fourth, etc. Finally this new array is reshaped into size [28*20, 28*20], and this is what gets sent to render(). The output image name is ‘W0.jpg’. This shows a tiled set of weights, each plotted as two dimensional images.

I was very confused about what this was actually doing. I think I finally managed to work it out. When you call .reshape([28,28,20,20]) it creates a four dimensional array. Think of the first two indices as X, Y coordinates pointing to a pixel at (X, Y). Each of these corresponds to a visible unit. For each there is an array of size 20×20 made from a single row of the original nn.W[0] array. So at this stage we’ve got a convenient way of looking at all of the weights connected to each individual visible unit. For example, if we wanted to visualize all 400 weights connected to say visible unit #345, that would just be the 20×20 array nn.W[0].reshape([28,28,20,20])[12][9], because pixel #345 is in the 12th row and 9th column (12*28+9 = 345).

Now what Gregoire wants to plot is something different — not 784 images each corresponding to the weights connected to a single visible unit, but instead 400 images, each corresponding to the weights coming from a single hidden unit in the first hidden layer and connecting out to all 784 visible units. Let’s call the indices in the visible unit space X, Y and the indices in the hidden unit space x, y. Then nn.W[0].reshape([X,Y,x,y]).transpose([2,0,3,1]) gives an array of size (x,X,y,Y). Calling .reshape([x*X, y*Y]) on this gives an array of size (x*X, y*Y) which is then plotted. I drew up a little sketch of what’s going on for a simple case where there are 9 visibles and 4 hiddens — take a look!

The original array nn.W[0] is on the left. The visualization Gregoire uses takes each column of this array, resizes it so it’s 2D, and plots all of these together in a big square array — that’s the array on the top right.

Here’s what a W1.jpg image looks like.

A W1 visualization. Here you can see a 20×20 array of 28×28 images. Each image shows the weights coming from a single hidden unit in the first hidden layer. You can think of these as the patterns the visible units think about when that particular hidden unit is turned on, and the rest are turned off.

The second time through the loop, the array W changes to the dot product of the first array we just visualized with the second array between the first and second hidden layers. This makes the new array Gregoire is visualizing of size 784×100, and the exact same process happens to this array as before. This is interesting, as now what we’re visualizing are the effects of turning on a single unit in the second hidden layer, turning off all the others in that layer, and propagating its influence through the connections to the first hidden layer units, and then through the connections between the first hidden layer and the visible units.

A W2 visualization, showing a 10×10 array of 28×28 images. Now we are looking at the effects of turning on each of the 100 second layer hidden units one at a time, propagated down to the visible units.

Lastly, there is a render() call on another property of the nn object — nn.X[0]. This property of nn holds a set of mb samples from the visible units of a freely running network, and has size (mb, 784). If things are working properly, these should look like the types of examples we’re trying to help the network learn — they are the generated instances from the internal model that’s been learned at the stage of the learning process when they are rendered. The reason that there are mb of them is that the way the DBM code works, there are mb different networks kept throughout the training (these are sometimes called ‘fantasy particles’). Note that the reason there are mb of these is not clear — there is no real reason to choose that number except it seems to be the thing people usually do (mb is the number of data objects used per training cycle). Here’s an example from the out of the box code, where there are mb=25 fantasy particles.

Generated values at the 784 (28×28) visible units from a freely running network. If the training is working, these should look like the kind of inputs the network is learning over. These are like the dreams of 25 different networks.

OK folks, hope that was useful. Next time I’m going to go over the heart of the code — the DBM class!

## 7 thoughts on “Gregoire Montavon’s code for training a centered DBM”

• Hi JimmyD! Yes I’m familiar Theano — we work with some of its developers. Montavon’s code is just a basic implementation of a centered DBM, but it does have some neat tricks in it! What I’m trying to do is fundamentally understand what’s going on at this point, not do any optimization or use the best tools.

1. Very interesting. Thanks. Probable corection:
“The [:mb] part only keeps the last mb elements of this array”. In numpy indexing is [start:stop:stap]. So [:mb] means [0:mb:1] and are first mb elements of array.

• Ah yes you’re right. Although all this is for is to select mb random rows so it wouldn’t matter which end it picked them from!