# Olympic Gold at the Creative Destruction Lab

The president of the University of Toronto, Dr. Meric Gertler, attended the last G7 meeting, which coincidentally was on the first day of the 2014 Winter Olympics. He presented us with medals. This is as close to Olympic gold as I’m going to get, so the gesture was appreciated.

I knew I should have worn my suit.

Here’s a ‘group shot’ with my G7 friends sporting their new hardware.

# Fusing sensor modalities using an LC-DBM

Our different senses differ in detail. The features that allow effective representation of audio signals are different than those for representing vision. More generally we’d expect that any particular sensor type we’d provide to a new Creature we’re designing would need some new and specific way to effectively represent the types of data it receives from the world, and these representations would be different from sensor to sensor.

The different actuators in our Creature should also get different representations. For example, the motor signals that drive the bulk movement of the Creature (say its wheels) probably have a very different character than those that drive fine motor skills (such as the movement of fingers).

In the DBM framework, there is a natural way to handle this. The approach is called a Locally Connected Deep Boltzmann Machine, or LC-DBM. We briefly encountered this concept in an earlier post, where it made an appearance in this figure (the picture on the right).

Different kinds of Boltzmann Machines. The visible units are grey, the hidden units are white.

Let’s see if we can build an interesting LC-DBM that we can run in hardware.

Embodying Cid in a robot

Imagine we have a robot that has two motors. One of these controls the movement of the back left wheel, and one controls the movement of the back right wheel. The robot will have a castor in front that’s free to rotate, so we get back wheel drive. We’ll assume for this first experiment that these motors only have two settings — off (0) and forward (1). This is not all that restrictive. While there are some things this type of Creature can’t do, he will be able to get to most places using just these movements.

We’ll give each of these motors visible units corresponding to two successive times. Now that we’re starting to think about embodiments, what these times are in actual physical units becomes important. In this case, we’ll set the interval between times to be much longer than the response time of the motors — say 450 ms.

The DBM we’ll start with to represent the two motors at two times will look the same as the one we used in experiment #3. Here it is.

For the motor sector, we use two visible units corresponding to whether a motor is off (0) or moving forward (1) at two successive times. t and t+1 differ by 450 ms.

This Creature is also going to be equipped with a camera, so we can also have vision neurons. A typical camera that you’d mount on a robot provides a huge amount of information, but what we’re going to do is to start off by only using a tiny fraction of it, and in a particularly dumb way. What we’ll do is take the images coming in from the camera, and separate them into two regions — the left and right halves of the full image. We’ll take all of the pixels in each side, average them, and threshold them such that if the average intensity of the pixels is 128 or higher, that means 1 (i.e. bright = 1) otherwise 0 (dark = 0). This mimics the thresholded photodetector ommatidia idea we discussed a couple of posts back, although now we have two of them — one for the left side of the creature’s vision, and one for the right side.

Again we’ll have two successive times. Typical cameras provide around 30 frames per second, which is a lot faster than the time we set for the motor response. So what we’ll do is average the camera results over 15 frames, so that we can keep the difference in time the same as the difference we chose for the motors. Again this is not the smartest thing we could do but we can improve this later! With these choices, here’s the DBM we will use for the vision system.

Note that the unit labels have been shifted by 4 for each from the motor system.

Now let’s equip our Creature with a speaker / microphone. As with the vision system, an audio system we can mount on a robot can provide us with very rich data. But we’ll ignore most of it for the time being. Analogously to the simple system we put in place for vision, let’s again choose two audio neurons, but this time instead of thresholding the intensity of the visual input on the left/right halves of the incoming images, we’ll threshold the intensity of two different frequencies, one low and one high, corresponding to 100 Hz and 1000 Hz. An input in each will be 0 if the fourier component of the signal over a total of 450ms was less than a threshold, and 1 if it’s greater. The idea is that if these frequencies are present, the corresponding audio neuron will be on, otherwise it will be off.

Here’s the DBM for the audio system.

Here’s the audio DBM. We’ve set it up so that there are two audio neurons, one of which fires if it senses 100Hz and the other which fires if it senses 1,000Hz.

Finally, let’s add a weapons system. We’ll mount a missile launcher on the robot. Because firing a missile is serious business, we’ll make it so that both weapons neurons have to be on simultaneously for a firing event, so 00, 01 and 10 mean ‘don’t fire’, and 11 means ‘fire’. Again we’ll have two times, separated by 450 ms. Here’s the weapons system DBM.

For the weapons system, we fire only if the most significant bit (MSB) and least significant bit (LSB) are both 1, else we don’t.

Connecting sensors and actuators at higher levels of the LC-DBM

OK so we have built four different DBMs for audio, vision, motor and weapons. But at the moment they are completely separate. Let’s fix that!

Here is an architecture that brings all four together, by combining the different modalities higher up in the LC-DBM.

An example of how we can connect up all four modalities. The orange level connects audio/weapons, audio/vision and motor/vision; the green level connects vision/audio/weapons and motor/vision/audio; and the top blue level connects all four.

This network can be embedded in hardware. I created this embedding by staring at it for a few minutes. There are probably much better ways to do it. But this one should work. Here it is!

This embedding uses a 4 by 6 block of unit cells, and should allow for all the connections in the network.

So that’s cool! Alright that’s enough for now, next time we’ll think about different experiments we can subject this New and Enhanced Cid to.

# First ever DBM trained using a quantum computer

In Terminator 2, Arnold reveals that his CPU is a neural net processor, a learning computer. Of course it is! What else would it be? Interestingly, there are real neural net processors in the world. D-Wave makes the only superconducting version, but there are other types out there also. Today we’ll use one of our superconducting neural nets to re-run the three experiments we did last time.

I believe this is the first time quantum hardware has been used to train a DBM, although there have been some theoretical investigations.

Embedding into hardware

Recall that the network we were training in the previous post had one visible layer with up to four units, and two hidden layers each with four units. For what follows we’re going to associate each of these units with a specific qubit in a Vesuvius processor. The way we’re going to do this is to use a total of 16 qubits in two unit cells to represent the 12 units in the DBM.

All D-Wave processors can be thought of as hardware neural nets, where the qubits are the neurons and the physical couplers between pairs of qubits are edges between qubits. Specifically you should think of them as a type of Deep Boltzmann Machine (DBM), where specifying the biases and weights in a DBM is exactly like specifying the biases and coupling strengths in a D-Wave processor. As in a DBM, what you get out are samples from a probability distribution, which are the (binary) states of the DBM’s units (both visible and hidden).

In the Vesuvius design, there is an 8×8 tile of eight-qubit unit cells, for a total of 512 ‘neurons’. Each neuron is connected to at most 6 other neurons in Vesuvius. To do the experiments we want to do, we only need two of the 64 unit cells. For the experts out there, we could use the rest to do some interesting tricks to use more of the chip, such as gauge transformations and simple classical parallelism, but for now we’ll just stick to the most basic implementation.

Here is a presentation containing some information about Vesuvius and its design. Take a look at slides 11-17 to get a high level overview of what’s going on.

Here is a picture of the DBM we set up in the last post.

Here we still have two neurons — one vision and one motor — but we have two different times (here labeled t and t+1).

Here is the embedding into hardware we’ll use. Hopefully this is clear! Each of the blue lines is a qubit. The horizontal qubits in unit cell #1 are strongly coupled to the horizontal qubits in unit cell #2 (represented by the red circles). We do this so that the variables in the first hidden layer can talk to all four variables in the second hidden layer (these are the four vertical qubits in unit cell #1) and all four visible units (these are the vertical qubits in unit cell #2).

The embedding into hardware we’ll use here. We use two units cells from the top left hand corner of the chip. The red circles indicate strong ferromagnetic coupling between the horizontal qubits in the two unit cells, which represent the four variables in the first hidden layer. The leftmost four vertical qubits represent the variables in the second hidden layer, while the rightmost four qubits represent the visible units.

Using the chip to replace the alternating Gibbs sampling step

Recall that the algorithm we used for training the DBM required drawing samples from two different distributions — the ‘freely running’ network, and a network with inputs clamped to values set by the data we are learning over. So now we have a hardware neural net. Can we do these two things directly?

The way the chip works is that we first program in a set of biases and weights, and then draw a bunch of samples from the probability distribution they create. So we should be able to do this by following a very simple prescription — do everything exactly the same as before, except replace the alternating Gibbs sampler with samples drawn from the hardware with its machine language parameters set to the current bias, offset and weight values.

The only tricky part of this (and it’s not really all that tricky) is to create the map between the biases, weights and offsets in the software model to the biases and weights in the hardware.

Experimental results: Running a real quantum brain

Here are the results of doing this for the three experiments we set up last time, but now comparing training the DBM using alternating Gibbs sampling in software to training the DBM by drawing samples from a Vesuvius 6 chip. The parameters of the run were 100 problems per minibatch, 100 epochs, 1000 learning steps per epoch, learning rate = 0.005 and reparametrization rate = 0 (I set it to zero just to make everything simpler for debugging — we could make it non-zero if we want).

Comparing Alternating Gibbs Sampling in software (blue) to drawing samples from Vesuvius (red). Both do great!

Same comparison, but for Experiment #2. Here we see something very interesting — the quantum version learns faster and gets a lot smarter!

Same but for experiment #3. Again the quantum version learns faster and gets smarter.

This is just so freaking cool.

A recap

So for the first time ever, a quantum computer has been used to train a DBM. We did this for three different experiments, and plotted the $S_0$ number as a function of epoch for 100 epochs. We compared the results of the DBM training on a Vesuvius chip to the same results using the standard alternating Gibbs sampling approach, and found that for experiments 2 and 3 the quantum version trained faster and obtained better scores.

This better performance is due to the replacement of the approximate AGS step with the correct sampling from the full probability distribution obtained from using Vesuvius.

# Three DBM experiments

Well that was a bit tedious. OK maybe more than a little bit. But now we can get back to tormenting Cid. I have in mind three experiments.

In order to modify Gregoire’s DBM code for these experiments, we only need to make some very small changes. Here is what we need to change:

1. The data we’re training over is different. We’ll have to create a new data array X to learn over. What this will look like depends on our choice of External Universe.
2. The size of the network (his brain) is much smaller.
3. The visualizations have to change because of 1. and 2.

Other than that everything is the same!

Experiment #1. Our Original Grumpy Universe

Here’s the setup for this one. Imagine we have a brand new creature we’ve Intelligently Designed. That’s Cid. He has a brain capable of building a model of his Universe. That’s the DBM with 1 visible unit, 4 hidden units in the first hidden layer, and 4 hidden units in the second hidden layer. His visible unit is the interface between his internal model and the External Universe (EU). In this first experiment, we design the EU so that when Cid is observing it, he sees either a 0 or a 1 enter into his visible unit.

Look me in the eye. Compound insect eyes are formed of lots of ommatidia.

You can think of his visible unit as being like a simple thresholded photodetector, which either doesn’t fire (is zero) or fires (is one), with the firing being triggered if the surrounding light is bright enough. In biological creatures, this type of vision unit might be similar to an ommatidium, which is a structure found in the vision systems of insects.

The EU we subject Cid to has the following properties. The ‘light’ entering into Cid’s visible unit changes 30 times per second, and is either 0 (i.e. dark) or 1 (light). The probability of the light being off we’ve arbitrarily set to 13.5%. By design, there are no correlations between subsequent events in this first experiment — there are no patterns in these sequences of light and dark.

The experiment we set up is as follows. We let Cid watch his EU for 1 hour — that’s 30*60*60=108,000 observations. During this time, his internal model is being trained. After this time is up, Cid ‘goes to sleep’, and dreams about his EU, where we generate 108,000 samples from the internal model he’s learned up to that point, and record the value of the visible unit for each. If his internal model is an accurate model of the EU, the probability of dreaming of dark generated by the internal model should be about 13.5%. We call one cycle of learning & sleep an epoch. We repeat this sequence for 14 epochs, which simulates the beginning of life for our new creature. For each sleeping period, we track the probability of the internal model dreaming of dark, and compute Cid’s $S_0$ number. As there is only only number that characterizes the EU, if the model can learn this number it has done its job, and this should be reflected in a large value for $S_0$.

Experiment #1: Results

The network performs well at all epochs in this case.

Here is a plot of the $S_0$ number for Cid over a period of 14 epochs. You can see that it’s very large and the value jumps around quite a bit. By looking at the actual probabilities his internal model generates, anything with an $S_0$ number of greater than about 100,000 is equivalent within statistical noise. So Cid’s brain is able to learn an excellent model of this EU, even after only one epoch of training, and doing more training, while it changes his brain configuration, doesn’t help much as all we see then is the effects of statistical fluctuations in the input data he’s seeing.

Here are the actual values of Cid’s network parameters after 14 epochs have concluded. This DBM has learned an excellent model for the EU in this experiment.

```weights = [array([[ 0.01813747,  0.00530222,  0.00424123,  0.00458312]], dtype=float32), array([[-0.00190055,  0.01029627,  0.00203666, 0.00552391], [ 0.00306078,  0.01077824,  0.00614936,  0.00418225], [ 0.00407983,  0.01392216,  0.02022623, -0.00406721], [ 0.00017785,  0.00020632,  0.009114  , -0.00081153]], dtype=float32)]
biases = [array([ 1.86096434]), array([-0.98480152, -1.03069905, -1.01625627, -0.97723272]), array([-1.01399916, -1.00536941, -1.01015504, -0.98526637])]
offsets = [array([ 0.86534613]), array([ 0.2712875 ,  0.26286839,  0.26594376,  0.27397069]), array([ 0.26666885,  0.26782288,  0.26658711,  0.27174567])]
```

Experiment #2. Adding motor output

Experiment #1 was great for getting a good understanding of how to train a DBM, and to start thinking about what it all means. But we want more! Here’s a very slight extension of Experiment #1, where we add a single new visible unit.

This new visible unit will represent a different type of thing. It will represent the direction of motion of Cid. Specifically, the new visible unit will have value 0 if Cid is moving to the left, and 1 if Cid is moving to the right.

With this new visible unit, Cid now has two visible units (one representing a ‘vision neuron’, and one representing a ‘motor neuron’). We’ll keep the hidden layers the same as before.

The network setup for experiment #2. It’s the same as before, except now there are two visible units. One is a vision neuron and the other is a motor neuron.

We’ll repeat the same basic experimental setup as the first one, where Cid first looks around at his EU for a while, and then dreams about it for a while, and repeats this for a few epochs. However because we’ve introduced a new type of visible unit we have some interesting issues arise.

Thinking of the motor neuron as a sensor AKA Avatar Mode

The first involves the training data. Now a piece of training data includes two bits — one is the visual input, and the other is a motor input. Understanding what a visual input is is pretty easy, but what’s a motor input? We usually think of motor as being an output — something we do, other than something that’s done to us. But here we’re going to imagine that during the time when Cid is awake, he’s actually being moved around by an external force. Suzanne calls this Avatar Mode — imagine you are controlling Cid’s motor output (say, by remote control), and the direction of his motor becomes the training data for the motor input.

For experiment #2, let’s assume that when Cid is learning and being ‘shown what to do’, he tends to be moved left when the light is off, and moved right when the light is on. What this means is that input data objects will tend to favor the bit states 00 (light is off, moving left) and 11 (light is on, moving right) over 01 (light is off, moving right) and 10 (light is on, moving left). If we again assume that there is no correlation in time between the states coming into the visible units, then this EU is fully characterized by three numbers — the probability of the vision neuron observing 0 and the motor neuron moving left; the probability of the vision neuron observing 0 and the motor neuron moving right; and the probability of the vision neuron observing 1 and the motor neuron moving right. The last possibility’s probability is fixed because these probabilities need to sum to one.

To track the learning we’ll again plot the $S_0$ number as a function of epoch. Recall that the $S_0$ number is the inverse of the KL-divergence, which compares the ‘true’ probability distribution found in the EU with the probability distribution generated by Cid’s freely running brain.

In our experiment, we’ll (again, arbitrarily) set $P_{00}=0.431, P_{11}=0.292, P_{01}=0.145, P_{10}=0.132$. These numbers fully characterize this EU.

Sampling with some subset of the visible units clamped

Now that we have visible units representing different types of thing (one vision, one movement), some new possibilities arise for investigating Cid’s behavior once his internal model has been trained. We now have the option of clamping some of the visible units and then asking what the remaining visible units are, by drawing samples from the internal model with some of the visible units clamped.

There are three different modes we can look at in experiment #2. The first is when we don’t clamp either visible unit, and we let the entire network run freely. Now the interpretation of the state of the motor neuron, when the network is run freely, is that the motor neuron is now an output — Cid is moving autonomously while his vision system is dreaming of dark and light.

The second mode is when we clamp the vision neuron to either light or dark and draw samples from the network to determine the state of the motor neuron. This is like Cid being awake, and we set the lights to whatever we like, and Cid moves autonomously based on what he’s learned about the correlations of light and movement during the training phase. This type of behavior is probably pretty similar to what we’ll want to do once we have an embodied version of Cid, and we want him to move around autonomously. Note that functionally this is the sort of behavior a lot of insects display — either attraction to or aversion to light.

The third mode is when we clamp the motor neuron and draw samples from the network to determine what Cid’s internal model ‘sees’ on his vision neuron based on the movement we’re forcing on him.

Experiment #2: Results

Here it doesn’t start out so well, but by the end the model is pretty good!

Here we see that the $S_0$ number doesn’t start very high, like it did in experiment #1. In fact after the first 5 epochs or so, it looks like Cid wasn’t getting a lot smarter. But all of a sudden around epoch #7 it looks like he finally started to ‘get it’! His internal model started effectively modeling what he was seeing. By the end of his training his understanding of this EU was comparable to his understanding of the simpler EU in experiment #1.

We can interpret this success as Cid’s having learned about the correlations between his visual input and his motor input — he was able to reach a full understanding of all there is to know about the EU in experiment #2.

Here’s the network parameters obtained after epoch #14.

```weights = [array([[-0.33172947,2.06676388,0.90290803], [ 1.94313753, -0.34757519,  2.05962396,  0.91246027]], dtype=float32), array([[-0.01521079,  0.007444  , -0.01788999,  0.01016755], [ 0.01918931,  0.00635643,  0.02085816, -0.01936374], [-0.00448515, -0.01009895, -0.00765508,  0.01153829], [-0.00094647, -0.00817871, -0.01026069, -0.0089241 ]], dtype=float32)]
biases = [array([-0.39935381, -0.29827521]), array([-0.72053314, -0.93864307, -0.67654053, -0.93797132]), array([-0.98044517, -1.01085684, -1.006881  , -0.98185932])]
offsets = [array([ 0.42313661,  0.43706216]), array([ 0.38207002,  0.28493565,  0.39176196,  0.30359983]), array([ 0.27248023,  0.26672524,  0.26748523,  0.27268372])]
```

Experiment #3. Explicitly adding time

In both of the previous experiments, the EU was chosen so that there were no correlations between subsequent inputs — there were no time-dependent patterns in the input bit strings Cid was being shown. Of course, in the EU humans are exposed to, there are such patterns. Let’s see if we can modify our DBM to be able to learn time-dependent patterns.

The way we’ll do this is pretty simple. Instead of having a single vision neuron and a single motor neuron, we’ll have two of each, where one pair represents the current observation and the other pair represents the immediately previous observation.

Here we still have two neurons — one vision and one motor — but we have two different times (here labeled t and t+1), and therefore the network has four visible units. We’ve grouped the vision and motor neurons together for a reason we’ll get to later!

An EU with time-dependent correlations

To test this architecture, let’s create an EU where there now are correlations in time between subsequent inputs. For a general 4 bit input, there are $2^4-1$ independent probabilities. Let’s set the probabilities of this EU to be $P_{0000}=0.30,P_{0001}=0.004,P_{0010}=0.008,P_{0011}=0.012,P_{0100}=0.016,P_{0101}=0.020,P_{0110}=0.024,P_{0111}=0.028,P_{1000}=0.032,P_{1001}=0.036,P_{1010}=0.040,P_{1011}=0.044,P_{1100}=0.048,P_{1101}=0.052,P_{1110}=0.056,P_{1111}=0.28$. These numbers are all pretty much arbitrary, although I made them different just to make sure I could tell if the model was capturing those differences, and I made the two states $0000$ and $1111$ most likely — these represent (a)  the situation where the vision neuron is reading dark and the motor is going left at time t, and the same thing happens at t+1, and (b) the situation where the vision neuron is reading light and the motor is going right at time t, and the same thing happens at t+1. So this EU favors dark/left and light/right and things staying the way they are in time. Again we’ll plot the $S_0$ number as a metric for how our learning is progressing.

A new type of thing we get from clamping some of the visible units — prediction

Imagine Cid takes in an observation of the current state of both its visible and motor neuron. Now if we draw a sample from his network with these clamped to the observed values, we obtain states for both at the next time step. These states are predictions about what Cid thinks should happen next, conditioned on his current observations. Isn’t that cool? His predictions are based on what he’s learned about the time dependence of his EU during the training phase.

Another interesting thing we can do is only clamp the visible unit to the currently observed light, and draw samples from the network. Cid will then move autonomously based on the current state of his motor neuron, and will also get a prediction about both what he should see and where he should move next.

Experiment #3: Results

Here the learning looks very linear.

As in the previous two experiments, we see that the $S_0$ number is increasing over the training period. Interestingly the rate of increase here looks linear, whereas in the previous two that was not the case. This could have something to do with the learning rate hyperparameter in the algorithm. It could be that the learning rate is too large for the first two and about right for this one.

Here are the final network parameters for Cid’s brain after epoch #14 for experiment #3.

```weights = [array([[ 0.68729705,  3.25865531,  1.52063966,  0.06205666], [ 1.29153585,  2.89045095,  1.39506388,  0.0642081 ], [ 1.45107603,  2.90390301,  1.613253  ,  0.0708109 ],  [ 1.65337193,  2.9249208 ,  1.90962994,  0.07118615]], dtype=float32), array([[-0.01320299,  0.0017458 ,  0.00971262, -0.01334796],  [ 0.00066732,  0.01005933,  0.02247284, -0.00862475],  [-0.00398598,  0.0030924 ,  0.02086396, -0.01616782], [-0.00087102, -0.00204386,  0.01797051,  0.00104328]], dtype=float32)]
biases = [array([ 0.29037783, -0.17516954, -0.47229764, -0.6564995 ]), array([-0.91211764,  3.56308915,  1.31385738, -1.03494297]), array([-1.01374428, -0.99906059, -0.95210618, -0.9959591 ])]
offsets = [array([ 0.5877514 ,  0.52328677,  0.49066822,  0.476114  ]), array([ 0.39119173,  0.68874638,  0.63992906,  0.26271362]), array([ 0.26620444,  0.2689028 ,  0.27874745,  0.27008831])]
```

Alright what have WE learned?

Cid has learned some things over the past few days. What have we learned?

Well it’s pretty clear that for the types of EU we’ve designed for Cid, even a very small DBM brain seems capable of reaching Enlightenment. This is kind of neat, especially when you consider that even for the more or less trivial cases we’ve been looking at, you can see how both sensor and actuator signals from a real embodied creature can be handled by the same framework. There is a clear way to enable autonomous behavior, where the machine entity makes its own decisions about what to do based on what it’s learned in the past. In addition, there is also a mechanism for ‘modeling the future’ which many folks believe (rightly, I think) is a key idea for understanding cognition.

Alright so next time we’ll take a look at how we might do the same types of learning, but using a Vesuvius processor… mmm quantum brains.