Fusing sensor modalities using an LC-DBM

Our different senses differ in detail. The features that allow effective representation of audio signals are different than those for representing vision. More generally we’d expect that any particular sensor type we’d provide to a new Creature we’re designing would need some new and specific way to effectively represent the types of data it receives from the world, and these representations would be different from sensor to sensor.

The different actuators in our Creature should also get different representations. For example, the motor signals that drive the bulk movement of the Creature (say its wheels) probably have a very different character than those that drive fine motor skills (such as the movement of fingers).

In the DBM framework, there is a natural way to handle this. The approach is called a Locally Connected Deep Boltzmann Machine, or LC-DBM. We briefly encountered this concept in an earlier post, where it made an appearance in this figure (the picture on the right).

Different kinds of Boltzmann Machines. The visible units are grey, the hidden units are white. Picture from this paper.

Different kinds of Boltzmann Machines. The visible units are grey, the hidden units are white.

Let’s see if we can build an interesting LC-DBM that we can run in hardware.

Embodying Cid in a robot

Imagine we have a robot that has two motors. One of these controls the movement of the back left wheel, and one controls the movement of the back right wheel. The robot will have a castor in front that’s free to rotate, so we get back wheel drive. We’ll assume for this first experiment that these motors only have two settings — off (0) and forward (1). This is not all that restrictive. While there are some things this type of Creature can’t do, he will be able to get to most places using just these movements.

We’ll give each of these motors visible units corresponding to two successive times. Now that we’re starting to think about embodiments, what these times are in actual physical units becomes important. In this case, we’ll set the interval between times to be much longer than the response time of the motors — say 450 ms.

The DBM we’ll start with to represent the two motors at two times will look the same as the one we used in experiment #3. Here it is.

For the motor sector, we use two visible units corresponding to whether a motor is off (0) or moving forward (1) at two successive times.

For the motor sector, we use two visible units corresponding to whether a motor is off (0) or moving forward (1) at two successive times. t and t+1 differ by 450 ms.

This Creature is also going to be equipped with a camera, so we can also have vision neurons. A typical camera that you’d mount on a robot provides a huge amount of information, but what we’re going to do is to start off by only using a tiny fraction of it, and in a particularly dumb way. What we’ll do is take the images coming in from the camera, and separate them into two regions — the left and right halves of the full image. We’ll take all of the pixels in each side, average them, and threshold them such that if the average intensity of the pixels is 128 or higher, that means 1 (i.e. bright = 1) otherwise 0 (dark = 0). This mimics the thresholded photodetector ommatidia idea we discussed a couple of posts back, although now we have two of them — one for the left side of the creature’s vision, and one for the right side.

Again we’ll have two successive times. Typical cameras provide around 30 frames per second, which is a lot faster than the time we set for the motor response. So what we’ll do is average the camera results over 15 frames, so that we can keep the difference in time the same as the difference we chose for the motors. Again this is not the smartest thing we could do but we can improve this later! With these choices, here’s the DBM we will use for the vision system.

Note that the unit labels have been shifted by 4 for each from the motor system.

Note that the unit labels have been shifted by 4 for each from the motor system.

Now let’s equip our Creature with a speaker / microphone. As with the vision system, an audio system we can mount on a robot can provide us with very rich data. But we’ll ignore most of it for the time being. Analogously to the simple system we put in place for vision, let’s again choose two audio neurons, but this time instead of thresholding the intensity of the visual input on the left/right halves of the incoming images, we’ll threshold the intensity of two different frequencies, one low and one high, corresponding to 100 Hz and 1000 Hz. An input in each will be 0 if the fourier component of the signal over a total of 450ms was less than a threshold, and 1 if it’s greater. The idea is that if these frequencies are present, the corresponding audio neuron will be on, otherwise it will be off.

Here’s the DBM for the audio system.

Cid_pics_25

Here’s the audio DBM. We’ve set it up so that there are two audio neurons, one of which fires if it senses 100Hz and the other which fires if it senses 1,000Hz.

Finally, let’s add a weapons system. We’ll mount a missile launcher on the robot. Because firing a missile is serious business, we’ll make it so that both weapons neurons have to be on simultaneously for a firing event, so 00, 01 and 10 mean ‘don’t fire’, and 11 means ‘fire’. Again we’ll have two times, separated by 450 ms. Here’s the weapons system DBM.

For the weapons system, we fire only if the most significant bit (MSB) and least significant bit (LSB) are both 1, else we don't.

For the weapons system, we fire only if the most significant bit (MSB) and least significant bit (LSB) are both 1, else we don’t.

Connecting sensors and actuators at higher levels of the LC-DBM

OK so we have built four different DBMs for audio, vision, motor and weapons. But at the moment they are completely separate. Let’s fix that!

Here is an architecture that brings all four together, by combining the different modalities higher up in the LC-DBM.

An example of how we can connect up all four modalities. The orange level connects audio/weapons, audio/vision and motor/vision; the green level connects vision/audio/weapons and motor/vision/audio; and the top blue level connects all four.

An example of how we can connect up all four modalities. The orange level connects audio/weapons, audio/vision and motor/vision; the green level connects vision/audio/weapons and motor/vision/audio; and the top blue level connects all four.

This network can be embedded in hardware. I created this embedding by staring at it for a few minutes. There are probably much better ways to do it. But this one should work. Here it is!

This embedding uses a 4 by 6 block of unit cells, and should allow for all the connections in the network.

This embedding uses a 4 by 6 block of unit cells, and should allow for all the connections in the network.

So that’s cool! Alright that’s enough for now, next time we’ll think about different experiments we can subject this New and Enhanced Cid to.

First ever DBM trained using a quantum computer

In Terminator 2, Arnold reveals that his CPU is a neural net processor, a learning computer. Of course it is! What else would it be? Interestingly, there are real neural net processors in the world. D-Wave makes the only superconducting version, but there are other types out there also. Today we’ll use one of our superconducting neural nets to re-run the three experiments we did last time.

I believe this is the first time quantum hardware has been used to train a DBM, although there have been some theoretical investigations.

Embedding into hardware

Recall that the network we were training in the previous post had one visible layer with up to four units, and two hidden layers each with four units. For what follows we’re going to associate each of these units with a specific qubit in a Vesuvius processor. The way we’re going to do this is to use a total of 16 qubits in two unit cells to represent the 12 units in the DBM.

All D-Wave processors can be thought of as hardware neural nets, where the qubits are the neurons and the physical couplers between pairs of qubits are edges between qubits. Specifically you should think of them as a type of Deep Boltzmann Machine (DBM), where specifying the biases and weights in a DBM is exactly like specifying the biases and coupling strengths in a D-Wave processor. As in a DBM, what you get out are samples from a probability distribution, which are the (binary) states of the DBM’s units (both visible and hidden).

In the Vesuvius design, there is an 8×8 tile of eight-qubit unit cells, for a total of 512 ‘neurons’. Each neuron is connected to at most 6 other neurons in Vesuvius. To do the experiments we want to do, we only need two of the 64 unit cells. For the experts out there, we could use the rest to do some interesting tricks to use more of the chip, such as gauge transformations and simple classical parallelism, but for now we’ll just stick to the most basic implementation.

Here is a presentation containing some information about Vesuvius and its design. Take a look at slides 11-17 to get a high level overview of what’s going on.

Here is a picture of the DBM we set up in the last post.

Here we still have two neurons -- one vision and one motor -- but we have two different times (here labeled t and t+1).

Here we still have two neurons — one vision and one motor — but we have two different times (here labeled t and t+1).

Here is the embedding into hardware we’ll use. Hopefully this is clear! Each of the blue lines is a qubit. The horizontal qubits in unit cell #1 are strongly coupled to the horizontal qubits in unit cell #2 (represented by the red circles). We do this so that the variables in the first hidden layer can talk to all four variables in the second hidden layer (these are the four vertical qubits in unit cell #1) and all four visible units (these are the vertical qubits in unit cell #2).

The embedding into hardware we'll use here. We use two units cells from the top left hand corner of the chip. The red circles indicate strong ferromagnetic coupling between the horizontal qubits in the two unit cells, which represent the four variables in the first hidden layer. The leftmost four vertical qubits represent the variables in teh second hidden layer, while the rightmost four qubits represent the visible units.

The embedding into hardware we’ll use here. We use two units cells from the top left hand corner of the chip. The red circles indicate strong ferromagnetic coupling between the horizontal qubits in the two unit cells, which represent the four variables in the first hidden layer. The leftmost four vertical qubits represent the variables in the second hidden layer, while the rightmost four qubits represent the visible units.

Using the chip to replace the alternating Gibbs sampling step

Recall that the algorithm we used for training the DBM required drawing samples from two different distributions — the ‘freely running’ network, and a network with inputs clamped to values set by the data we are learning over. So now we have a hardware neural net. Can we do these two things directly?

The way the chip works is that we first program in a set of biases and weights, and then draw a bunch of samples from the probability distribution they create. So we should be able to do this by following a very simple prescription — do everything exactly the same as before, except replace the alternating Gibbs sampler with samples drawn from the hardware with its machine language parameters set to the current bias, offset and weight values.

The only tricky part of this (and it’s not really all that tricky) is to create the map between the biases, weights and offsets in the software model to the biases and weights in the hardware.

Experimental results: Running a real quantum brain

Here are the results of doing this for the three experiments we set up last time, but now comparing training the DBM using alternating Gibbs sampling in software to training the DBM by drawing samples from a Vesuvius 6 chip. The parameters of the run were 100 problems per minibatch, 100 epochs, 1000 learning steps per epoch, learning rate = 0.005 and reparametrization rate = 0 (I set it to zero just to make everything simpler for debugging — we could make it non-zero if we want).

Comparing Alternating Gibbs Sampling in software (blue) to drawing samples from Vesuvius (red). Both do great!

Comparing Alternating Gibbs Sampling in software (blue) to drawing samples from Vesuvius (red). Both do great!

Same comparison, but for Experiment #2. Here we see something very interesting -- the quantum version learns faster and gets a lot smarter!

Same comparison, but for Experiment #2. Here we see something very interesting — the quantum version learns faster and gets a lot smarter!

Same but for experiment #3. Again the quantum version learns faster and gets smarter.

Same but for experiment #3. Again the quantum version learns faster and gets smarter.

This is just so freaking cool.

A recap

So for the first time ever, a quantum computer has been used to train a DBM. We did this for three different experiments, and plotted the S_0 number as a function of epoch for 100 epochs. We compared the results of the DBM training on a Vesuvius chip to the same results using the standard alternating Gibbs sampling approach, and found that for experiments 2 and 3 the quantum version trained faster and obtained better scores.

This better performance is due to the replacement of the approximate AGS step with the correct sampling from the full probability distribution obtained from using Vesuvius.

Montavon’s code — the DBM class

Last time we did a review of Gregoire Montavon’s code for training DBMs, but left the most important part for last. Let’s take a look at the last bit. This is the heart of the code.

The DBM class and its member variables

Here is the front matter and the initialization of the member variables for the class.

# ====================================================
# Centered deep Boltzmann machine
# ----------------------------------------------------
# - self.W: list of weight matrices between layers
# - self.B: list of bias associated to each unit
# - self.O: list of offsets associated to each unit
# - self.X: free particles tracking model statistics
# ====================================================

class DBM:
# --------------------------------------------
# Initialize model parameters and particles
# --------------------------------------------
def __init__(self, M, B):
     self.W = [numpy.zeros([m, n]).astype('float32') for m, n in zip(M[:-1], M[1:])]
     self.B = [numpy.zeros([m]).astype('float32') + b for m, b in zip(M, B)]
     self.O = [sigm(b) for b in self.B]
     self.X = [numpy.zeros([mb, m]).astype('float32') + o for m, o in zip(M, self.O)]

Last time we saw that the inputs to the class were a list of the number of variables in each layer (that’s M above — in the case we looked at, M was the list [784, 400, 100]), and a list of biases on the variables (that’s B above). The list of biases has three entries. The first is an array containing all individual biases on the visible units. The second is a single number which is the common bias on all hidden layer 1 units. The third is a single number which is the common bias on all hidden layer 2 units. So B is a list of length 3 in the case we looked at last.

Now let’s take a look at how the member variables are initialized. First some pythonic notation. M[:-1] means the list M without its last entry. So for the MNIST example M[:-1] = [784, 400]. M[1:] means the list without the zeroth entry — that would be M[1:] = [400, 100]. The function zip(A, B) acts like a zipper — it returns a list of tuples, where each tuple contains the ith element of A and the ith element of B. For example, zip([1,2,3], [4,5,6]) = [(1,4), (2,5), (3,6)]. So zip(M[:-1], M[1:]) returns a list of tuples, each of which contains the ith and (i+1)th number of variables in the deep net. For example, if we had a deep net containing 5 layers, with M = [784, 400, 100, 50, 10], then M[:-1] = [784, 400, 100, 50] and M[1:] = [400, 100, 50, 10], and zip(M[:-1], M[1:]) = [ (784, 400), (400, 100), (100, 50), (50, 10)]. Isn’t that clever? I love stuff like this.

Alright so now we can see what the first member variable looks like. self.W = [numpy.zeros([m, n]).astype(‘float32′) for m, n in zip(M[:-1], M[1:])] creates a list of arrays. Each array is initially filled with zeros and will hold the connection weights we are trying to learn. There is one array for each pair of connected layers in the network. Each has size [m,n] where m is the number of variables in the lower level and n is the number in the upper level. In the case where M = [784, 400, 100] this means that self.W is a list with two entries, the first being an array of size [784, 400] and the second being an array of size [400,100]. The data type of the weights is set to float32.

Next let’s take a look at the bias term. Here we have self.B = [numpy.zeros([m]).astype(‘float32′) + b for m, b in zip(M, B)]. The zip command creates a set of tuples. The first of these is (784, array_containing_biases_of_length_784), the second is (400, -1) and the third is (100, -1). The initialization command therefore creates a list with 3 entries. The first is a 784×1 array initialized with the biases on the visible units; the second a 400×1 array with each element equal to -1; the third is a 100×1 array with each element equal to -1. This is also clever, as the same command is used on both array-style and integer entries in B. If you wanted to create individualized initial biases on the hidden layers, the exact same command should work, but instead of an integer (like -1) you feed it an array of the correct dimension.

OK so after this is done, self.B is a list with 3 entries. The zeroth entry is an array containing the biases on the visible units. The first entry is an array containing the biases on the first hidden layer units. The second entry is an array containing the biases on the second hidden layer units.

Next we initialize the member variable that will hold the offsets. These offsets are used to ‘center’ the DBM, which is useful for making the optimization better behaved. This one is pretty simple. self.O = [sigm(b) for b in self.B] creates a list containing three elements. The first applies the sigm function to all entries in the first entry of self.B; the second does the same to the second entry of self.B; and the third does the same to the third entry of self.B. The reason the sigm() function is used is that the offsets are initially chosen to represent the mean value of the variable, given the bias at that variable and assuming all the weights connected to it are zero. The sigm() function does that for us (assuming Boltzmann statistics).

Lastly we initialize the self.X member variable. The zip(M, self.O) command creates a list of three tuples of the form [(784, array_containing_offsets_on_visibles), (400, array_containing_offsets_on_first_hidden_layer), (100, array_containing_offsets_on_second_hidden_layer)].

When self.X = [numpy.zeros([mb, m]).astype(‘float32′) + o for m, o in zip(M, self.O)] is called, it creates a list with three entries. These have sizes [mb, 784], [mb, 400] and [mb, 100], where recall mb is a global variable specifying the minibatch size (the number of data examples used in each cycle of learning). In each of these, each row is initialized to the offset numbers in the corresponding entry of self.O — each of the mb rows in each of the 3 arrays in self.X is initially identical.

Gibbs sampling

The first method defined in the class is .gibbs(X, l). Let’s check it out.

# --------------------------------------------
# Gibbs activation of a layer
# --------------------------------------------
def gibbs(self, X, l):
     bu = numpy.dot(X[l-1]-self.O[l-1], self.W[l-1]) if l > 0 else 0
     td = numpy.dot(X[l+1]-self.O[l+1], self.W[l].T) if l+1 < len(X) else 0
     X[l] = realize(sigm(bu + td + self.B[l]))

The method takes two input arguments. The second of these is the layer of the network (that’s l, which is an integer), where l=0 is the visible layer, l=1 is the first hidden layer, l=2 is the second hidden layer, etc.

Inside the method you can see two quantities being defined — these stand for ‘bottom up’ and ‘top down’. These are conditionally defined. If you take a look at the first one (bu), you can see that bu=0 if l=0, otherwise it’s the specified dot product. This means that if l=0 (the visible layer), there is no layer below it so the ‘bottom up’ contribution is zero. Similarly for the top down quantity, if we are in the topmost layer of the network, there is no layer above that, so td is set to zero, otherwise it’s the dot product.

The first input argument to the method (X) is a list of arrays. We can see from the structure of the method that this list should have the same number of elements as self.O — that is, there should be the same number of elements as the number of layers in the network. There is a very clever trick that’s being played here about what the actual size of the arrays inside the list X are. Note that the command X[l-1]-self.O[l-1] is sensible as long as the number of columns in X[l-1] matches the length of self.O[l-1] — X[l-1] can have an arbitrary number of rows and this will still work. If there is more than one row in X[l-1] then each row gets self.O[l-1] subtracted off. We’ll see how this flexibility is used to good effect shortly!

Alright so let’s look at the dot product defining bu. We have bu = numpy.dot(X[l-1]-self.O[l-1], self.W[l-1]) if l > 0 else 0. So let’s assume l>0. What’s the dot product doing?

The size of the first term will be the same size as X[l-1]. The weight array W[l-1] is the array of weights immediately below layer l. If we imagine that each row of the array X holds the states of the variables in the l-1 layer, and the array self.O[l-1] are the offsets on those, then (X-O)*W are the effects of the l-1 layer variables felt by the variables on layer 1 — the ‘bottom up’ effect.

Similarly the ‘top down’ quantity does the exact same thing, but propagates the influence of the layer above down to layer l.

To see where the forms of these dot products originate, take a look at Equation 3 of the Montavon paper — for a three layer network, the .gibbs() method is doing the work behind updating y_d, z_d, x_m, y_m and z_m. Whenever you see two entries in these, we’re at layer 1 and there are both top down and bottom up contributions. Whenever there is only one, we’re either at layer 0 and there’s only a top down, or at layer 3 and there’s only a bottom up.

The final call in the method is X[l] = realize(sigm(bu + td + self.B[l])). This is exactly the form of the expressions in Equation 3, where X[0] = x, X[1] = y, and X[2] = z. This updates the list of arrays X. The effect of doing this is a little confusing as there are two different types of input arrays X that will be sent in — we’ll get to this shortly!

The reparametrization method

Next we’ll look at the methods for changing the offset parameters. Here’s the code:

# --------------------------------------------
# Reparameterization
# --------------------------------------------
def reparamB(self, X, i):
     bu = numpy.dot((X[i-1]-self.O[i-1]), self.W[i-1]).mean(axis=0) if i > 0 else 0
     td = numpy.dot((X[i+1]-self.O[i+1]), self.W[i].T).mean(axis=0) if i+1 < len(X) else 0
     self.B[i] = (1-rr)*self.B[i] + rr*(self.B[i] + bu + td)

def reparamO(self, X, i):
     self.O[i] = (1-rr)*self.O[i] + rr*X[i].mean(axis=0)

As in the gibbs method, for both of these there are two input variables, where the second is a layer index, which he calls i here (instead of l). The structure of the first of these is very similar to the gibbs method. Note however that there is a .mean(axis=0) call for both bu and td. The reparamB method is doing the job in Equation 3 of modifying \vec{a}, \vec{b} and \vec{c} due to a non-zero $\latex \nu$ (the part proportional to \eta is dealt with separately). He writes it in a funny way but it’s just the stuff proportional to \nu in Eq. 3. The reason we need the .mean(axis=0) call is that the first input X can have an arbitrary number of rows. The .mean call averages over all these rows.

The reparamO method does the job of changing the offset parameters, which in Equation 3 are the vectors \vec{\alpha}, \vec{\beta} and \vec{\gamma}. The .mean() call averages over all of the rows of the input X.

The last step — the learning step!

OK we’re now at the last part of this, where we can tie this up and some of the more mysterious parts of the previous steps will be clarified. Here’s the .learn() method:

# --------------------------------------------
# Learning step
# --------------------------------------------
def learn(self, Xd):

# Initialize a data particle
X = [realize(Xd)]+[self.X[l]*0+self.O[l] for l in range(1, len(self.X))]

# Alternate gibbs sampler on data and free particles
for l in (range(1, len(self.X), 2)+range(2, len(self.X), 2))*5:
     self.gibbs(X, l)
for l in (range(1, len(self.X), 2)+range(0, len(self.X), 2))*1:
     self.gibbs(self.X, l)

# Parameter update
for i in range(0, len(self.W)):
     self.W[i] += lr*(numpy.dot((X[i]-self.O[i]).T, X[i+1]-self.O[i+1]) - numpy.dot((self.X[i]-self.O[i]).T, self.X[i+1]-self.O[i+1]))/len(Xd)
for i in range(0, len(self.B)):
     self.B[i] += lr*(X[i]-self.X[i]).mean(axis=0)

# Reparameterization
for l in range(0, len(self.B)):
     self.reparamB(X, l)
for l in range(0, len(self.O)):
     self.reparamO(X, l)

Alright, all the rest was just preliminaries — this is where the rubber hits the road!!! The .learn method takes one input (Xd). Going back to the Example of Execution section, we see that the one time this method is called, Xd is an mb row, 784 column array where the rows have been randomly selected from the entire training set.

The first bit is

# Initialize a data particle
X = [realize(Xd)]+[self.X[l]*0+self.O[l] for l in range(1, len(self.X))]

Recall that realize(Xd) generates a random number for each entry in Xd, and if the number is less than the entry, a 0 is returned for that entry otherwise it’s set to 1. So [realize(Xd)] returns a list with a single entry, which is an mbx784 array of 0.0 and 1.0 values. To see what the second term is, recall that len(self.X) is the number of layers in the network — in this case that would be 3. So this second term has two entries — self.O[1] and self.O[2]. X is now a three element list, where the first element is mbx784, the second is self.O[1] and the third is self.O[2].

The second bit is

# Alternate gibbs sampler on data and free particles
for l in (range(1, len(self.X), 2)+range(2, len(self.X), 2))*5:
     self.gibbs(X, l)
for l in (range(1, len(self.X), 2)+range(0, len(self.X), 2))*1:
     self.gibbs(self.X, l)

The first loop, with len(self.X)=3, iterates over l=1 and l=2 five times. The five number is the number of alternating Gibbs sampler steps we take to equilibrate the network. Using X as input means that for this loop, we are running the Gibbs sampler with the visible units fixed to the data values — the ones in the first entry of X.

Let’s go through this step by step. The first time it’s called, we have self.gibbs(X, 1). Because we start with layer 1, the gibbs method will compute both bu and td quantities. Let’s see what they are! bu = numpy.dot(X[0]-self.O[0], self.W[0]). X[0] holds the mbx784 data, and self.O[0] are the offsets in that bottom layer. self.W[0] are the weights connecting the visibles and the first hidden layer (these are initially zero). td = numpy.dot(X[2]-self.O[2], self.W[1].T). Here X[2] is self.O[2], so the initial td value is zero. We see also why he added the self.X[l]*0 term — I think if it wasn’t there the dot product dimensionality would be off.

After these, we get X[1] = realize(sigm(bu + td + self.B[1])). Both bu and td are initially zero, so X[1]=realize(sigm(self.B[1])), which assigns 1s and 0s to create a binary array of length 400 (the number of hiddens in the first layer).

Next self.gibbs(X, 2) is called. The same steps are repeated, except now td is zero and bu = numpy.dot(X[1]-self.O[1], self.W[1]). X[1] is as found previously, self.O[1] are the offsets on the first hidden layer, and self.W[1] are the weights between the second and first hidden layers. X[2] = realize(sigm(bu + self.B[2])) is now set.

Then self.gibbs(X,1) is called again. Now X[2] has changed, and therefore the returned value of X[1] will change. Then self.gibbs(X,2) is called. Etc. This is repeated a total of five times.

Note that this procedure is carrying along mb different data objects and acting on them in parallel.

Now the next step performs the alternating Gibbs procedure on self.X. self.X is different than X. X holds the data we’re learning over, which changes each time we run a new minibatch, whereas self.X is a member variable of the DBM class and persists throughout the entire learning process. self.X holds the ‘fantasy particles’ that are supposed to represent the freely running network. self.X is initially [numpy.zeros([mb, m]).astype(‘float32′) + o for m, o in zip(M, self.O)].

The loop iterates over l=1, then l=0, then l=2 — so the default is to just do one pass through the alternating Gibbs sampler without the visibles clamped.

Alright we’re almost there! The last bit of the method updates the weights, biases and offsets. Here’s the code:

# Parameter update
for i in range(0, len(self.W)):
     self.W[i] += lr*(numpy.dot((X[i]-self.O[i]).T, X[i+1]-self.O[i+1]) - numpy.dot((self.X[i]-self.O[i]).T, self.X[i+1]-self.O[i+1]))/len(Xd)
for i in range(0, len(self.B)):
     self.B[i] += lr*(X[i]-self.X[i]).mean(axis=0)

# Reparameterization
for l in range(0, len(self.B)):
     self.reparamB(X, l)
for l in range(0, len(self.O)):
     self.reparamO(X, l)

To see where these come from, go back to Equation 3 in the Montavon paper. Note that the x_d, y_d, z_d quantities are represented by X whereas x_m, y_m, z_m are represented by self.X. The division by len(Xd) normalizes over the mb objects. The .mean(axis=0) call averages over the mb rows.

Whew that was a lot of code review. But now I think we understand the entire thing!

Gregoire Montavon’s code for training a centered DBM

In the previous post, I showed a movie of how some of our DBM’s parameters changed over time during training. Now I’ll show you how I got those numbers. The core of the code was written by Gregoire Montavon, and can be found here. This code implements the ideas in the Deep Boltzmann Machines and the Centering Trick paper.

I’d originally intended to write the DBM training code from scratch, and I’m shortly going to do that anyway so we can run the training in hardware, but Gregoire’s code contains some really cool tricks that I’d like to go over. It took me a while to figure out how the code worked, because it’s so compact.

The code is deceptively short, so I’m just going to go through it section by section and provide some hopefully helpful editorial comments along the way.

Code Review 1. Front Matter

 # ----------------------------------------------------
 # Training a Centered Deep Boltzmann Machine
 # ----------------------------------------------------
 #
 # Copyright: Gregoire Montavon
 #
 # This code is released under the MIT licence:
 # http://www.opensource.org/licenses/mit-license.html
 #
 # ----------------------------------------------------
 #
 # This code is based on the paper:
 #
 #   G. Montavon, K.-R. Mueller
 #   Deep Boltzmann Machines and the Centering Trick
 #   in Neural Networks: Tricks of the Trade, 2nd Edn
 #   Springer LNCS, 2012
 #
 # ----------------------------------------------------
 #
 # This code is a basic implementation of the centered
 # deep Boltzmann machines (without model averaging,
 # minibatching and other optimization hacks). The code
 # also requires the MNIST dataset that can be
 # downloaded at http://yann.lecun.com/exdb/mnist/.
 #
 # ----------------------------------------------------

If you want to run the code ‘out of the box’, you need to download and unpack the MNIST dataset, and remember where you put it.

Code Review 2. Imports, global parameters and helper functions


import numpy,Image

# ====================================================
# Global parameters
# ====================================================
lr      = 0.005     # learning rate
rr      = 0.005     # reparameterization rate
mb      = 25        # minibatch size
hlayers = [400,100] # size of hidden layers
biases  = [-1,-1]   # initial biases on hidden layers

# ====================================================
# Helper functions
# ====================================================
def arcsigm(x): return numpy.arctanh(2*x-1)*2
def sigm(x):    return (numpy.tanh(x/2)+1)/2
def realize(x): return (x > numpy.random.uniform(0,1,x.shape))*1.0
def render(x,name):
     x = x - x.min() + 1e-9
     x = x / (x.max() + 1e-9)
     Image.fromarray((x*255).astype('byte'),'L').save(name)

Imports are straightforward, just numpy and Image. Image is required because the code visualizes some things as it proceeds. It does this by outputting images at regular intervals.

The global parameters include two rates. The first, lr, is the learning rate \eta in the paper. This is the ‘regular’ learning rate in a DBM. The second, rr, is the reparametrization rate \nu in the paper. This tells us how quickly to change the centering quantities \alpha, \beta and \gamma. To see where \eta and \nu enter into the algorithm, see Figure 3 in the paper. I haven’t talked about the centering idea before, but it’s really simple — the concept is that each node in the network is reparametrized such that the variable representing it is shifted by an amount equal to its mean value, so that the new variable is ‘centered’ around zero. Doing this doesn’t change anything as it just shifts the entire energy function by a constant, but it makes the optimization much better behaved.

The next global is mb, which is the minibatch size. This sets the number of data objects the learning algorithm will consider for each cycle of training (that is, reconfiguring the biases, weights and offsets). In the past few posts, I’ve just been assuming that this quantity is equal to 1, i.e. Cid only gets to see one data object per training cycle, but it’s straightforward to instead send in a batch and run each training cycle over a batch instead of a single instance.

Next is hlayers, where the number of hidden units in each layer are specified — the leftmost number is the number in hidden layer 1, and the rightmost in hidden layer 2. For Cid this was [4, 4].

Last is biases. The way this implementation works, the first of these will be the initial bias shared by all hidden units in layer 1, and the second will be the initial bias shared by all hidden units in layer 2. So for the code above, \vec{b} = [-1, -1, ..., -1] and \vec{c} = [-1, -1, ..., -1] where \vec{b} has 400 entries and \vec{c} has 100 entries.

The helper functions contain two useful functions when working with Boltzmann distributions, arcsigm and sigm. They are written over tanh instead of exp for numerical stability reasons, but you can check that they are equivalent.

The realize(x) function is interesting. Here’s what it does. Assume you have an array X, where each element of X is a real number between 0 and 1 (these are probabilities in this context). If you call X0 = realize(X), what happens is that a random array with entries in the range (0,1) is created of the same size as X. Then for each entry in X, the corresponding random number is compared to that entry. If the random number is less than the entry, a zero is returned in the corresponding entry in X0, otherwise it’s set to 1. The returned array X0 contains only 0 and 1 elements, and is a sample from the probability distribution provided by X. This is a nice way of performing the B(p) operation — sampling from a Bernoulli distribution, where the input is an arbitrarily sized array.

The render function takes as input a two dimensional array x and a name. It then rescales the entries of the array to be within the range (0,1), and produces a greyscale image of these by first increasing the range to (0,255) by multiplying by 255, then converting floats to byte, and saving as greyscale (that’s the ‘L’ bit) as whatever name you fed it.

Code review 3. Example of execution

I’m going to do this out of order and leave the actual DBM class to last. Let’s take a look now at the “Example of execution” section.


# ====================================================
# Example of execution
# ====================================================

# Initialize MNIST dataset and centered DBM
X = (numpy.fromfile(open('train-images.idx3-ubyte','r'),dtype='ubyte',count=16+784*60000)[16:].reshape([60000,784])).astype('float32')/255.0
nn = DBM([784]+hlayers,[arcsigm(numpy.clip(X.mean(axis=0),0.01,0.99))]+biases)

for it in range(1000):

     # Perform some learning steps
     for _ in range(100): nn.learn(X[numpy.random.permutation(len(X))[:mb]])

     # Output some debugging information
     print(("%03d |" + "%.3f "*len(nn.W))%tuple([it]+[W.std() for W in nn.W]))
     W = 1
     for l in range(len(nn.W)):
          W = numpy.dot(W,nn.W[l])
          m = int(W.shape[1]**.5)
          render(W.reshape([28,28,m,m]).transpose([2,0,3,1]).reshape([28*m,28*m]),'W%d.jpg'%(l+1));
     render((nn.X[0]).reshape([mb,28,28]).transpose([1,0,2]).reshape([28,mb*28]),'X.jpg');

The first thing Gregoire does is to load in the training data into the X array. In his example, this data is the MNIST training set, which comprises a total of 60,000 28×28 pixel greyscale images of handwritten digits. Let’s unpack that first line a bit to see what’s going on.

The function numpy.fromfile() is interesting, I hadn’t seen this before but it looks very useful. Here’s the documentation for it. It takes three arguments in this case. The first is the file where your data lives (note: if you want to run this code, here is where you have to specify where the MNIST file lives). The second is the datatype of the returned array, which here is specified as ‘ubyte’ (unsigned byte). The third is the number of items to read. So the numpy.fromfile() in this case says read 16+784*60000 items from the MNIST file and return an array where each of these is in ubyte format. Note that 784 = 28×28, and the 16 is there because the file has some front matter junk.

After this is done, Gregoire asks for all entries in this returned array from the 16th onwards (that’s to exclude the first 16 entries comprising the non-data front matter). That’s the [16:] part. Then he asks for the array to be reshaped to have 60,000 rows and 784 columns (that’s the .reshape([60000, 784]) part). This makes it so that each row of the array contains a training example. He then recasts the array as type ‘float32′, and then divides every element by 255.0, so that each element of the array now lives in the range (0,1) (note that MNIST, being greyscale images, has elements defined in the range (0,255)). After all of this, we end up with an array X with 60,000 rows, 784 columns, with each entry a float32 in the range (0,1).

The next line is where he creates a DBM object. We haven’t looked at the DBM class yet, but let’s take a look at what the inputs look like here. There are two of them. The first is [784] + hlayers. What this does is it appends the entries in hlayers to 784 in a list. So the first input to DBM is a list comprising the number of units in each layer (in this case, there are 784 visible units, one per pixel in the MNIST images). So [784] + hlayers would be the list [784, 400, 100] in the out of the box example here.

The second input is more interesting. Recall that X is the array holding all available training data, where each row of X contains a single data element. The operation X.mean(axis=0) creates a new array holding the mean value across the entire training set on a pixel by pixel basis (axis=0 refers to the rows, axis=1 is the columns), so X.mean(axis=0) is a one row by 784 column array where each entry is the average value of that pixel.

numpy.clip() takes an input array and ‘clips’ it so that each entry is within a specified range. In this case, this is necessary because the average value of pixels in MNIST can be 0 or 1, because all of the images are surrounded by black borders and have white pixels near the middle. Having 0 or 1 values is not OK, because the next thing he does is take the arcsigm of these elements, and arcsigm(0) and arcsigm(1) are not defined.

The reason that he chooses to use these initial biases on the visible units is that, in the absence of the effects of the hidden units, this choice makes it so that the probabilities of each visible unit being on is equal to the mean value of the pixels over the training set — this is a reasonable place to start the learning process. The reason this is true is that the probability of an isolated node being on, given Boltzmann statistics, is P(x=1) = sigm(a), where a is the bias on the isolated node. If we then invert and solve for a we get a=arcsigm(P(x=1)) (see the Montavon paper, Section 2 right above Equation 1). There is an interesting property of this data that’s being used here — rescaled greyscale images contain entries in the range (0,1). In this code, these entries are interchangeably used as scaled greyscale values in images and as probabilities.

So the second input to DBM is a list of length 784 comprising the average pixel values over the entire MNIST training set, clipped to within the range 0.01..0.99, to which is added the list biases, which is just a two element list. You may start to see what’s going on here — this second input to DBM will be the initial biases on each node in the DBM. The 784 element list is just the vector \vec{a} (one entry per visible unit), while the second to last entry in the list will be the (shared) bias value for the entire vector \vec{b}, and the last entry will be the (shared) bias value for the entire vector \vec{c}.

Next Gregoire defines two loops. The outer loop is

for it in range(1000):

and the inner loop is

for _ in range(100): nn.learn(X[numpy.random.permutation(len(X))[:mb]])

Let’s start with the inner loop. nn is the object we initialized from the DBM class. This class has a method .learn(). Let’s take a look at what’s being fed to this method. Here len(X) is an integer — the number of rows in array X (which in this case is 60,000). The function numpy.random.permutation(), when fed an integer N, randomly permutes the sequence (0..N-1) and returns that permutation as a one dimensional array of integers. The [:mb] part only keeps the last mb elements of this array (recall mb is the ‘minibatch size’, whose default value in the code is 25, so this would be the last 25 elements of the randomly permuted set of integers up to 59,999). Let’s call this length mb array Y0. Then X[Y0] returns the rows of X corresponding to the entries in Y0. So the array that’s being fed to the .learn() method is an mb row, 784 column array where the rows have been randomly selected from the entire training set.

For each iteration of the outer loop, this inner loop is repeated 100 times (of course this is just a parameter so we could change this). Whatever .learn() is doing, it’s doing it 100 times in a row, each time taking as input a new set of randomly chosen mb images from the full training set.

Now back to the outer loop! After the inner loop is completed, Gregoire prints some information and saves some images to disk. Let’s see what he’s doing here. Here’s the bit we’ll look at now.

     # Output some debugging information
     print(("%03d |" + "%.3f "*len(nn.W))%tuple([it]+[W.std() for W in nn.W]))
     W = 1
     for l in range(len(nn.W)):
          W = numpy.dot(W,nn.W[l])
          m = int(W.shape[1]**.5)
          render(W.reshape([28,28,m,m]).transpose([2,0,3,1]).reshape([28*m,28*m]),'W%d.jpg'%(l+1));
     render((nn.X[0]).reshape([mb,28,28]).transpose([1,0,2]).reshape([28,mb*28]),'X.jpg');

The first print command requires a property of the nn object called nn.W. I will jump ahead a bit here and tell you that nn.W is a list of arrays that store the weights in the Boltzmann machine (the numbers defined on the edges connecting nodes). The k^{th} element of this list contains the weight array between the k^{th} and (k+1)^{th} layers. So for a three layer network, there are two elements to this list; nn.W[0] contains the weights between the visible (i.e. zeroth) layer and the first hidden layer, and nn.W[1] contains the weights between the first and second hidden layers.

The command [W.std() for W in nn.W] creates a list of floats. Each entry of this list is the standard deviation across all entries in the array W, for each W in nn.W. In our case, there are two entries in nn.W (nn.W[0] and nn.W[1]), so this would create a two-element list where the first element was the standard deviation of all the elements in nn.W[0] (i.e. the weights connecting the visible and first hidden layer), and the second element was the same but across all the elements of nn.W[1], the weights connecting the first and second hidden layers. The command tuple([it] + [W.std() for W in nn.W]) converts the list created by appending the stds to the iteration parameter it to a tuple, which in this case is a tuple over three elements (784, std(nn.W[0]), std(nn.W[1])). len(nn.W) is just the number of arrays in the list (in this case, two).

Alright so now we can see what this print command is doing. What it outputs is first the iteration number, followed by a separator |, then the standard deviation of the weights starting with the weights between the visible and first hidden layer, and then the first and second hidden layer.

Next there is a loop which is performed len(nn.W) times (which recall is the total number of layers minus one). The first time through the loop, l=0, and W is set to nn.W[0], the array containing the weights between the visible and first hidden layer. m is then set to the square root of the number of columns in nn.W[0]. This number will be the square root of the number of hidden units in the first layer (so square root of 400 -> 20). Given the weights and this number, the render() function is called.

The nn.W[0] array has size (784, 400). In order to plot it, Gregoire first resizes by calling .reshape([28,28,20,20]), and then permutes the axes of the reshaped array. The .transpose([2,0,3,1]) call returns the reshaped array with the second axis first, the first axis fourth, etc. Finally this new array is reshaped into size [28*20, 28*20], and this is what gets sent to render(). The output image name is ‘W0.jpg’. This shows a tiled set of weights, each plotted as two dimensional images.

I was very confused about what this was actually doing. I think I finally managed to work it out. When you call .reshape([28,28,20,20]) it creates a four dimensional array. Think of the first two indices as X, Y coordinates pointing to a pixel at (X, Y). Each of these corresponds to a visible unit. For each there is an array of size 20×20 made from a single row of the original nn.W[0] array. So at this stage we’ve got a convenient way of looking at all of the weights connected to each individual visible unit. For example, if we wanted to visualize all 400 weights connected to say visible unit #345, that would just be the 20×20 array nn.W[0].reshape([28,28,20,20])[12][9], because pixel #345 is in the 12th row and 9th column (12*28+9 = 345).

Now what Gregoire wants to plot is something different — not 784 images each corresponding to the weights connected to a single visible unit, but instead 400 images, each corresponding to the weights coming from a single hidden unit in the first hidden layer and connecting out to all 784 visible units. Let’s call the indices in the visible unit space X, Y and the indices in the hidden unit space x, y. Then nn.W[0].reshape([X,Y,x,y]).transpose([2,0,3,1]) gives an array of size (x,X,y,Y). Calling .reshape([x*X, y*Y]) on this gives an array of size (x*X, y*Y) which is then plotted. I drew up a little sketch of what’s going on for a simple case where there are 9 visibles and 4 hiddens — take a look!

The original array nn.W[0] is on the left. The visualization Gregoire uses takes each column of this array, resizes it so it's 2D, and plots all of these together in a big square array -- that's the array on the top right.

The original array nn.W[0] is on the left. The visualization Gregoire uses takes each column of this array, resizes it so it’s 2D, and plots all of these together in a big square array — that’s the array on the top right.

Here’s what a W1.jpg image looks like.

A W1 visualization. Here you can see a 20x20 array of 28x28 images. Each image shows the weights coming from a single hidden unit in the first hidden layer. You can think of these as the patterns the visible units think about when that particular hidden unit is turned on, and the rest are turned off.

A W1 visualization. Here you can see a 20×20 array of 28×28 images. Each image shows the weights coming from a single hidden unit in the first hidden layer. You can think of these as the patterns the visible units think about when that particular hidden unit is turned on, and the rest are turned off.

The second time through the loop, the array W changes to the dot product of the first array we just visualized with the second array between the first and second hidden layers. This makes the new array Gregoire is visualizing of size 784×100, and the exact same process happens to this array as before. This is interesting, as now what we’re visualizing are the effects of turning on a single unit in the second hidden layer, turning off all the others in that layer, and propagating its influence through the connections to the first hidden layer units, and then through the connections between the first hidden layer and the visible units.

A W2 visualization, showing a 10x10 array of 28x28 images. Now we are looking at the effects of turning on each of the 100 second layer hidden units one at a time, propagated down to the visible units.

A W2 visualization, showing a 10×10 array of 28×28 images. Now we are looking at the effects of turning on each of the 100 second layer hidden units one at a time, propagated down to the visible units.

Lastly, there is a render() call on another property of the nn object — nn.X[0]. This property of nn holds a set of mb samples from the visible units of a freely running network, and has size (mb, 784). If things are working properly, these should look like the types of examples we’re trying to help the network learn — they are the generated instances from the internal model that’s been learned at the stage of the learning process when they are rendered. The reason that there are mb of them is that the way the DBM code works, there are mb different networks kept throughout the training (these are sometimes called ‘fantasy particles’). Note that the reason there are mb of these is not clear — there is no real reason to choose that number except it seems to be the thing people usually do (mb is the number of data objects used per training cycle). Here’s an example from the out of the box code, where there are mb=25 fantasy particles.

Fantasy particles.

Generated values at the 784 (28×28) visible units from a freely running network. If the training is working, these should look like the kind of inputs the network is learning over. These are like the dreams of 25 different networks.

OK folks, hope that was useful. Next time I’m going to go over the heart of the code — the DBM class!

It’s like the quantum computer is playing 20 questions…

I’ve been thinking about the BlackBox compiler recently and came up with a very interesting analogy to the way it works. There are actually lots of different ways to think about how BlackBox works, and we’ll post more of them over time, but here is a very high level and fun one.

The main way that you use BlackBox is to supply it with a classical function which computes the “goodness” of a given bitstring by returning a real number (the lower this number, the better the bitstring was).

Whatever your optimization problem is, you need to write a function that encodes your problem into a series of bits (x1, x2, x3…. xN) to be discovered, and which also computes how “good” a given bitstring (e.g. 0,1,1…0) is. When you pass such a function to Blackbox, the quantum compiler then repeatedly comes up with ideas for bitstrings, and using the information that your function supplies about how good its “guesses” are, it quickly converges on the best bitstring possible.

So using this approach the quantum processor behaves as a co-processor to a classical computing resource. The classical computing resources handles one part of the problem (computing the goodness of a given bitstring), and the quantum computer handles the other (suggesting bitstrings). I realized that this is described very nicely by the two computers playing 20 questions with one another.

Quantum computer 20 questions

The quantum computer suggests creative solutions to a problem, and then the classical computer is used to give feedback on how good the suggested solution is. Using this feedback, BlackBox will intelligently suggest a new solution. So in the example above, Blackbox knows NOT to make the next question “Is it a carrot?”

There is actually a deep philosophical point here. One of the pieces that is missing in the puzzle of artificial intelligence is how to make algorithms and programs more creative. I have always been an advocate of using quantum computing to power AI, but we now start to see concrete ways in which it could really start to address some of the elusive problems that crop up when trying to build intelligent machines.

At D-Wave, we have been starting some initial explorations in the areas of machine creativity and machine dreams, but it is early days and the pieces are only just starting to fall into place.

I was wondering if you could use the QC to actually play 20 questions for real. This is quite a fun application idea. If anyone has any suggestions for how to craft 20 questions into an objective function, let me know. My first two thoughts were to do something with Wordnet and NLTK. You could try either a pattern matching or a machine learning version of ‘mining’ wordnet for the right answer. This project would be a little Watson-esque in flavour.

New tutorials on devPortal: WMIS and MCS

There are two new tutorials on the website, complete with code snippets! Click on the images to go to the tutorial pages on the developer portal:

Quantum computer tutorial quantum programming

This tutorial (above) describes how to solve Weighted Maximum Independent Set (WMIS) problems using the hardware. Finding the Maximum Independent Set of a bunch of connected variables can be very useful. At a high level, the MIS it gives us information about the largest number of ‘things’ that can be achieved from a set when lots of those ‘things’ have conflicting requirements. In the tutorial, an example is given of scheduling events for a sports team, but you can imagine all sorts of variants: Train timetabling to improve services, assigning patients to surgeons to maximize the throughput of vital operations and minimize waiting lists, adjusting variable speed limits on motorways to reduce traffic jams during periods of congestion, etc etc.

Quantum computer tutorial quantum programming

This tutorial (above) describes how to find Maximum Common Subgraphs given two graphs. The example given in this tutorial is in molecule matching. Looking for areas where sub-structures in molecules are very similar can give us information about how such molecules behave. This is just one simple example of MCS. You can also imagine the same technique being applied to social networks to look for matches between the structuring of social groups. This technique could be used for improving ad placement or even for detecting crime rings.

These two tutorials are closely linked – as finding the MCS involves finding the MIS as part of the process. There are also lots of interesting applications of both these methods in graph and number theory.

If anyone would like to implement WMIS or MCS to solve any of the problem ideas mentioned in this post, please feel free!

The dreams of spiritual machines

When I was in middle school, every year we had to select a project to work on. These projects came from a list of acceptable projects. The projects were typical science-ish projects you’d expect a seventh grader to take on. One year my project was about whooping cranes. Not sure why I picked that one. Maybe I thought it might be related to whooping cough.

One year the subject I picked was dreams. What were they? How did they come about? What, if anything did they tell us about our waking life? I remember being intensely fascinated by the topic at the time, feeling that the answers I was getting to my questions from grown-ups and the encyclopedias checked out from the school library (there was no internet back then, at least in a form I could access) were not satisfactory at all. This was one of my earliest realizations that there were questions no-one yet knew the answers to.

The subject of dreams has come up in my adult life several times, and each time the same questions about them bubble up from my early encounter with them. An acquaintance of mine went through a period of having night terrors, where she would scream so loud that it would wake people in neighboring houses. She described them as being a sense of horror and dread of the most intense and indescribable kind, with sure knowledge that it would never end. This led to multiple 911 calls over periods of years. Several trips to specialists and tests revealed nothing out of the ordinary. Then one day they suddenly stopped. To this day no one has a good explanation for why they started, or why they stopped.

One of my friends has multiple vivid, realistic dreams every night, and he remembers them. They are also often terrifying. I on the other hand rarely dream, or if I do, I don’t remember them.

Recently I have been thinking of dreams again, and I have four computer scientists to thank. One of them is Bill Macready, who is my friend and colleague at D-Wave, and inventor of the framework I’ll introduce shortly. The second is Douglas Hofstadter. The third is Geoff Hinton. The fourth is David Gelertner.

Gelertner is a very interesting guy. Not only is he a rock star computer scientist (Bill Joy called him “one of the most brilliant and visionary computer scientists of our time”), he is also an artist, entrepreneur and a writer with an MA is classical literature. He was injured badly opening a package from the Unabomber in 1993. He is the author of several books, but the one I want to focus on now is The Muse in the Machine, which  is must-read material for anyone interested in artificial intelligence.

In this book, Gelertner presents a compelling theory of cognition that includes emotion, creativity and dreams as a central critically important aspect of the creation of machines that think, feel and act as we do. In this theory, emotion, creativity, analogical thought and even spirituality are viewed as being essential to the creation of machines that behave as humans do. I can’t do the book justice in a short post – you should read it.

I am going to pull one quote out of the book though, but before I do I want to briefly touch on what Geoff Hinton has to do with all of this. Hinton is also a rock star in the world of artificial intelligence, and in particular in machine learning. He was one of the inventors of back propagation, and a pioneer in deep belief nets and unsupervised learning. A fascinating demo I really like starts around the 20:00 mark of this video. In this demo, he runs a deep learning system ‘in reverse’, in generative mode. Hinton refers to this process as the system “fantasizing” about the images it’s generating; however Hinton’s fantasizing can also be thought of as the system hallucinating, or even dreaming, about the subjects it has learned. Systems such as these exhibit what I believe to be clear instances of creativity – generating instances of objects that have never existed in the world before, but share some underlying property. In Hinton’s demo, this property is “two-ness”.

Alright so back to Gelertner, and the quote from The Muse in the Machine:

A computer that never hallucinates cannot possibly aspire to artificial thought.

While Gelertner speaks a somewhat different language than Hinton, I believe that the property of a machine that he is referring to here – the ability to hallucinate, fantasize or dream – is exactly the sort of thing Hinton is doing with his generative digit model. When you run that model I would argue that you are seeing the faintest wisps of the beginning of true cognition in a machine.

Douglas Hofstadter is probably the most famous of the four computer scientists I’ve been thinking about recently. He is of course the author of Godel, Escher, Bach, which every self-respecting technophile has read, but more importantly he has been a proponent for the need to think about cognition from a very different perspective than most computer scientists. For Hofstadter, creativity and analogical reasoning are the key points of interest he feels we need to understand in order to understand our own cognition. Here he is in the “Pattern-finding as the Core of Intelligence” introduction to his Fluid Analogies book:

In 1977, I began my new career as a professor of computer science, aiming to specialize in the field of artificial intelligence. My goals were modest, at least in number: first, to uncover the secrets of creativity, and second, to uncover the secrets of consciousness, by modeling both phenomena on a computer. Good goals. Not easy.

All four of these folks share a perspective that understanding how analogical thinking and creativity work is an important and under-studied part of building machines like us.

Recently we’ve been working on a series of projects that are aligned with this sort of program. The basic framework is introduced here, in an introductory tutorial.

This basic introduction is extended here.

One of the by-products of this work is a computing system that generates vivid dreamscapes. You can look at one of these by clicking on the candle photograph above, or by following through the Temporal QUFL tutorial, or by clicking on the direct link below.

The technical part of how these dreamscapes are generated is described in these tutorials.  I believe these ideas are important. These dreamscapes remind me of H.P. Lovecraft’s Dreamlands, and this from Celephais:

There are not many persons who know what wonders are opened to them in the stories and visions of their youth; for when as children we learn and dream, we think but half-formed thoughts, and when as men we try to remember, we are dulled and prosaic with the poison of life. But some of us awake in the night with strange phantasms of enchanted hills and gardens, of fountains that sing in the sun, of golden cliffs overhanging murmuring seas, of plains that stretch down to sleeping cities of bronze and stone, and of shadowy companies of heroes that ride caparisoned white horses along the edges of thick forests; and then we know that we have looked back through the ivory gates into that world of wonder which was ours before we were wise and unhappy.

I hope you like them.