# Fusing sensor modalities using an LC-DBM

Our different senses differ in detail. The features that allow effective representation of audio signals are different than those for representing vision. More generally we’d expect that any particular sensor type we’d provide to a new Creature we’re designing would need some new and specific way to effectively represent the types of data it receives from the world, and these representations would be different from sensor to sensor.

The different actuators in our Creature should also get different representations. For example, the motor signals that drive the bulk movement of the Creature (say its wheels) probably have a very different character than those that drive fine motor skills (such as the movement of fingers).

In the DBM framework, there is a natural way to handle this. The approach is called a Locally Connected Deep Boltzmann Machine, or LC-DBM. We briefly encountered this concept in an earlier post, where it made an appearance in this figure (the picture on the right).

Different kinds of Boltzmann Machines. The visible units are grey, the hidden units are white.

Let’s see if we can build an interesting LC-DBM that we can run in hardware.

Embodying Cid in a robot

Imagine we have a robot that has two motors. One of these controls the movement of the back left wheel, and one controls the movement of the back right wheel. The robot will have a castor in front that’s free to rotate, so we get back wheel drive. We’ll assume for this first experiment that these motors only have two settings — off (0) and forward (1). This is not all that restrictive. While there are some things this type of Creature can’t do, he will be able to get to most places using just these movements.

We’ll give each of these motors visible units corresponding to two successive times. Now that we’re starting to think about embodiments, what these times are in actual physical units becomes important. In this case, we’ll set the interval between times to be much longer than the response time of the motors — say 450 ms.

The DBM we’ll start with to represent the two motors at two times will look the same as the one we used in experiment #3. Here it is.

For the motor sector, we use two visible units corresponding to whether a motor is off (0) or moving forward (1) at two successive times. t and t+1 differ by 450 ms.

This Creature is also going to be equipped with a camera, so we can also have vision neurons. A typical camera that you’d mount on a robot provides a huge amount of information, but what we’re going to do is to start off by only using a tiny fraction of it, and in a particularly dumb way. What we’ll do is take the images coming in from the camera, and separate them into two regions — the left and right halves of the full image. We’ll take all of the pixels in each side, average them, and threshold them such that if the average intensity of the pixels is 128 or higher, that means 1 (i.e. bright = 1) otherwise 0 (dark = 0). This mimics the thresholded photodetector ommatidia idea we discussed a couple of posts back, although now we have two of them — one for the left side of the creature’s vision, and one for the right side.

Again we’ll have two successive times. Typical cameras provide around 30 frames per second, which is a lot faster than the time we set for the motor response. So what we’ll do is average the camera results over 15 frames, so that we can keep the difference in time the same as the difference we chose for the motors. Again this is not the smartest thing we could do but we can improve this later! With these choices, here’s the DBM we will use for the vision system.

Note that the unit labels have been shifted by 4 for each from the motor system.

Now let’s equip our Creature with a speaker / microphone. As with the vision system, an audio system we can mount on a robot can provide us with very rich data. But we’ll ignore most of it for the time being. Analogously to the simple system we put in place for vision, let’s again choose two audio neurons, but this time instead of thresholding the intensity of the visual input on the left/right halves of the incoming images, we’ll threshold the intensity of two different frequencies, one low and one high, corresponding to 100 Hz and 1000 Hz. An input in each will be 0 if the fourier component of the signal over a total of 450ms was less than a threshold, and 1 if it’s greater. The idea is that if these frequencies are present, the corresponding audio neuron will be on, otherwise it will be off.

Here’s the DBM for the audio system.

Here’s the audio DBM. We’ve set it up so that there are two audio neurons, one of which fires if it senses 100Hz and the other which fires if it senses 1,000Hz.

Finally, let’s add a weapons system. We’ll mount a missile launcher on the robot. Because firing a missile is serious business, we’ll make it so that both weapons neurons have to be on simultaneously for a firing event, so 00, 01 and 10 mean ‘don’t fire’, and 11 means ‘fire’. Again we’ll have two times, separated by 450 ms. Here’s the weapons system DBM.

For the weapons system, we fire only if the most significant bit (MSB) and least significant bit (LSB) are both 1, else we don’t.

Connecting sensors and actuators at higher levels of the LC-DBM

OK so we have built four different DBMs for audio, vision, motor and weapons. But at the moment they are completely separate. Let’s fix that!

Here is an architecture that brings all four together, by combining the different modalities higher up in the LC-DBM.

An example of how we can connect up all four modalities. The orange level connects audio/weapons, audio/vision and motor/vision; the green level connects vision/audio/weapons and motor/vision/audio; and the top blue level connects all four.

This network can be embedded in hardware. I created this embedding by staring at it for a few minutes. There are probably much better ways to do it. But this one should work. Here it is!

This embedding uses a 4 by 6 block of unit cells, and should allow for all the connections in the network.

So that’s cool! Alright that’s enough for now, next time we’ll think about different experiments we can subject this New and Enhanced Cid to.

# First ever DBM trained using a quantum computer

In Terminator 2, Arnold reveals that his CPU is a neural net processor, a learning computer. Of course it is! What else would it be? Interestingly, there are real neural net processors in the world. D-Wave makes the only superconducting version, but there are other types out there also. Today we’ll use one of our superconducting neural nets to re-run the three experiments we did last time.

I believe this is the first time quantum hardware has been used to train a DBM, although there have been some theoretical investigations.

Embedding into hardware

Recall that the network we were training in the previous post had one visible layer with up to four units, and two hidden layers each with four units. For what follows we’re going to associate each of these units with a specific qubit in a Vesuvius processor. The way we’re going to do this is to use a total of 16 qubits in two unit cells to represent the 12 units in the DBM.

All D-Wave processors can be thought of as hardware neural nets, where the qubits are the neurons and the physical couplers between pairs of qubits are edges between qubits. Specifically you should think of them as a type of Deep Boltzmann Machine (DBM), where specifying the biases and weights in a DBM is exactly like specifying the biases and coupling strengths in a D-Wave processor. As in a DBM, what you get out are samples from a probability distribution, which are the (binary) states of the DBM’s units (both visible and hidden).

In the Vesuvius design, there is an 8×8 tile of eight-qubit unit cells, for a total of 512 ‘neurons’. Each neuron is connected to at most 6 other neurons in Vesuvius. To do the experiments we want to do, we only need two of the 64 unit cells. For the experts out there, we could use the rest to do some interesting tricks to use more of the chip, such as gauge transformations and simple classical parallelism, but for now we’ll just stick to the most basic implementation.

Here is a presentation containing some information about Vesuvius and its design. Take a look at slides 11-17 to get a high level overview of what’s going on.

Here is a picture of the DBM we set up in the last post.

Here we still have two neurons — one vision and one motor — but we have two different times (here labeled t and t+1).

Here is the embedding into hardware we’ll use. Hopefully this is clear! Each of the blue lines is a qubit. The horizontal qubits in unit cell #1 are strongly coupled to the horizontal qubits in unit cell #2 (represented by the red circles). We do this so that the variables in the first hidden layer can talk to all four variables in the second hidden layer (these are the four vertical qubits in unit cell #1) and all four visible units (these are the vertical qubits in unit cell #2).

The embedding into hardware we’ll use here. We use two units cells from the top left hand corner of the chip. The red circles indicate strong ferromagnetic coupling between the horizontal qubits in the two unit cells, which represent the four variables in the first hidden layer. The leftmost four vertical qubits represent the variables in the second hidden layer, while the rightmost four qubits represent the visible units.

Using the chip to replace the alternating Gibbs sampling step

Recall that the algorithm we used for training the DBM required drawing samples from two different distributions — the ‘freely running’ network, and a network with inputs clamped to values set by the data we are learning over. So now we have a hardware neural net. Can we do these two things directly?

The way the chip works is that we first program in a set of biases and weights, and then draw a bunch of samples from the probability distribution they create. So we should be able to do this by following a very simple prescription — do everything exactly the same as before, except replace the alternating Gibbs sampler with samples drawn from the hardware with its machine language parameters set to the current bias, offset and weight values.

The only tricky part of this (and it’s not really all that tricky) is to create the map between the biases, weights and offsets in the software model to the biases and weights in the hardware.

Experimental results: Running a real quantum brain

Here are the results of doing this for the three experiments we set up last time, but now comparing training the DBM using alternating Gibbs sampling in software to training the DBM by drawing samples from a Vesuvius 6 chip. The parameters of the run were 100 problems per minibatch, 100 epochs, 1000 learning steps per epoch, learning rate = 0.005 and reparametrization rate = 0 (I set it to zero just to make everything simpler for debugging — we could make it non-zero if we want).

Comparing Alternating Gibbs Sampling in software (blue) to drawing samples from Vesuvius (red). Both do great!

Same comparison, but for Experiment #2. Here we see something very interesting — the quantum version learns faster and gets a lot smarter!

Same but for experiment #3. Again the quantum version learns faster and gets smarter.

This is just so freaking cool.

A recap

So for the first time ever, a quantum computer has been used to train a DBM. We did this for three different experiments, and plotted the $S_0$ number as a function of epoch for 100 epochs. We compared the results of the DBM training on a Vesuvius chip to the same results using the standard alternating Gibbs sampling approach, and found that for experiments 2 and 3 the quantum version trained faster and obtained better scores.

This better performance is due to the replacement of the approximate AGS step with the correct sampling from the full probability distribution obtained from using Vesuvius.

# Sparse coding on D-Wave hardware: finding an optimal structured dictionary

I spend most of my time thinking about machine intelligence. I would like to build machines that can think and act like we do. There are many hard problems to solve before we get there.

C. elegans is vermiform, with a cuticle integument and a fluid-filled pseudocoelomate cavity. There are two sexes: male and hermaphrodite.

A thing I’ve been thinking about recently is how the limits of cognition matter in understanding cognition itself. Clearly you can have examples where the machinery is not sophisticated enough. For example a C. elegans roundworm is not going to reverse engineer its own neural system any time soon, even though it is very simple (although the descendents of openworm might…).

Openworm is a really interesting idea. I hope it does well and succeeds. Although the idea of the first life forms on the internet being worms, that OBVIOUSLY will grow super intelligent and take over the universe and consume all its atoms making bigger and bigger Harlem Shake videos, is a little off-putting from a human perspective. De Vermis Mysteriis

We are so smart, s-m-r-t

Hoomans are smrt.

Imagine the most intelligent entity possible. Would that thing be able to understand its own cognition? As you crank up a cognitive system’s ability to model its environment, presumably the cognition system itself gets more difficult to understand.

Is the human cognitive system both smart enough and simple enough to self reverse engineer? It’s probably in the right zone. We seem to be smart enough to understand enough of the issues to take a good run at the problem, because our cognition system is simple enough to not be beyond the ability of our cognition system itself to understand. How’s that for some Hofstadter style recursion.

Anyway enough with the Deep Thoughts. Let’s do some math! Math is fun. Not as fun as universe eating worms. But solving this problem well is important. At least to me and my unborn future vermiform army. Maybe you can help solve it.

A short review of L0-norm sparse coding with structured dictionaries

Last time we discussed sparse coding on the hardware, I introduced an idea for getting around a problem in using D-Wave style processor architectures effectively – the mismatch between the connectivity of the problem we want to solve and the connectivity of the hardware.

Let’s begin by first reviewing the idea. If you’d like a more in-depth overview, here is the original post I wrote about it. Here is the condensed version.

Given

1. A set of $S$ data objects $\vec{z}_s$, where each $\vec{z}_s$ is a real valued vector with $N$ components;
2. An $N x K$ real valued matrix $\hat{D}$, where $K$ is the number of dictionary atoms we choose, and we define its $k^{th}$ column to be the vector $\vec{d}_k$;
3. A $K x S$ binary valued matrix $\hat{W}$, whose matrix elements are $w_{ks}$;
4. And a real number $\lambda$, which is called the regularization parameter,

Find $\hat{W}$ and $\hat{D}$ that minimize

$G(\hat{W}, \hat{D} ; \lambda) = \sum_{s=1}^S || \vec{z}_{s} - \sum_{k=1}^{K} w_{ks} \vec{d}_k ||^2 + \lambda \sum_{s=1}^S \sum_{k=1}^{K} w_{ks}$

subject to the constraints that $\vec{d}_j \cdot \vec{d}_m = 0$ for all pairs $j,m$ that are not connected in the quantum chip being used.

To solve this problem, we use block coordinate descent, which works like this:

1. First, we generate a random dictionary $\hat{D}$, subject to meeting the orthogonality constraints we’ve imposed on the dictionary atoms.
2. Assuming these fixed dictionaries, we solve the optimization problem for the dictionary atoms $\hat{W}$. These optimization problems are now Chimera-structured QUBOs that fit exactly onto the hardware by construction.
3. Now we fix the weights to these values, and find the optimal dictionary $\hat{D}$, again subject to our constraints.

We then iterate steps 2 and 3 until $G$ converges to a minimum, keeping in mind that this problem is jointly non-convex and the minimum you get will be a local minimum. Each restart of the whole algorithm from a new standing point will lead to a different local minimum, so a better answer can be had by running this procedure several times.

Step 3: Finding an optimal structured dictionary given fixed weights

The hard problem is Step 3 above. Here the weights $\hat{W}$ are fixed, and we want to find an optimal structured dictionary. Here is the formal statement of the problem.

Given

1. An $N x S$ real valued matrix $\hat{Z}$, where $S$ is the number of data objects, and we define the $s^{th}$ column to be the $s^{th}$ data object $\vec{z}_s$, where each $\vec{z}_s$ is a real valued vector with $N$ components, and the matrix elements of $\hat{Z}$ are $z_{ns}$;
2. An $N x K$ real valued matrix $\hat{D}$, where $K$ is the number of dictionary atoms we choose, and we define its $k^{th}$ column to be the vector $\vec{d}_k$, and the matrix elements of $\hat{D}$ are $d_{nk}$;
3. And a $K x S$ binary valued matrix $\hat{W}$ with matrix elements $w_{ks}$;

Find $\hat{D}$ that minimizes

$G^{*}(\hat{D}) = \sum_{s=1}^S || \vec{z}_{s} - \sum_{k=1}^{K} w_{ks} \vec{d}_k ||^2$

${=\sum_{s=1}^S \sum_{n=1}^N (z_{ns}-\sum_{k=1}^{K} w_{ks} d_{nk})^2}$

${=||\hat{Z}-\hat{D} \hat{W}||^2=Tr(\hat{A}^T \hat{A})}$

where $\hat{A}=\hat{Z}-\hat{D} \hat{W}$, subject to the constraints that $\vec{d}_j \cdot \vec{d}_m = 0$ for all pairs $j,m$ that are not connected in the quantum chip being used.

What makes this problem hard is that the constraints on the dictionary atoms are non-linear, and there are a lot of them (one for each pair of variables not connected in hardware).

Ideas for attacking this problem

I’m not sure what the best approach is for trying to solve this problem. Here are some observations:

• We want to be operating in the regime where $\hat{W}$ is sparse. In this limit most of the $w_{ks}$ will be zero. Because the coupling term is quadratic in $\hat{W}$‘s matrix elements, for all L0-norm sparse coding problems most of the coupling terms are going to be zero. This suggests a possible strategy where we could first solve for $\hat{D}$ assuming that the quadratic term was zero, and then whatever we do next could use this as an initial starting point.
• There are some types of matrix operations that would not mess up the structure of the dictionary but would allow parametrization of changes within the allowed space. If we could then optimize over those parameters we could take care of the constraints without having to do any work to enforce them.
• There is a local search heuristic where you can optimize each dictionary atom $\vec{d}_k$ moving from $k=1$ to $k=K$ in order while keeping the other columns fixed, and just iterating until convergence (you have to do some rearranging to ensure the orthogonality is maintained throughout using the null space idea in the previous post). This will probably by itself not be a great strategy and will probably get you stuck in local optima but maybe it would work OK.

What do you think? I might be able to get you time on a machine if you can come up with an interesting way to solve this problem effectively… :-)

# Sparse coding on D-Wave hardware: structured dictionaries

The underlying problem we saw last time, that prevented us from using the hardware to compete with tabu on the cloud, was the mismatch of the connectivity of the problems sparse coding generates (which are fully connected) and the connectivity of the hardware.

The source of this mismatch is the quadratic term in the objective function, which for the $j^{th}$ and $m^{th}$ variables is proportional to $\vec{d}_j \cdot \vec{d}_m$. The coupling terms are proportional to the dot product of the dictionary atoms.

Here’s an idea. What if we demand that $\vec{d}_j \cdot \vec{d}_m$ has to be zero for all pairs of variables $j$ and $m$ that are not connected in hardware? If we can achieve this structure in the dictionary, we get a very interesting result. Instead of being fully connected, the QUBOs with this restriction can be engineered to exactly match the underlying problem the hardware solves. If we can do this, we get closer to using the full power of the hardware.

L0-norm sparse coding with structured dictionaries

Here is the idea.

Given

1. A set of $S$ data objects $\vec{z}_s$, where each $\vec{z}_s$ is a real valued vector with $N$ components;
2. An $N x K$ real valued matrix $\hat{D}$, where $K$ is the number of dictionary atoms we choose, and we define its $k^{th}$ column to be the vector $\vec{d}_k$;
3. A $K x S$ binary valued matrix $\hat{W}$;
4. And a real number $\lambda$, which is called the regularization parameter,

Find $\hat{W}$ and $\hat{D}$ that minimize

$G(\hat{W}, \hat{D} ; \lambda) = \sum_{s=1}^S || \vec{z}_{s} - \sum_{k=1}^{K} w_{ks} \vec{d}_k ||^2 + \lambda \sum_{s=1}^S \sum_{k=1}^{K} w_{ks}$

subject to the constraints that $\vec{d}_j \cdot \vec{d}_m = 0$ for all pairs $j,m$ that are not connected in the quantum chip being used.

The only difference here from what we did before is the last sentence, where we add a set of constraints on the dictionary atoms.

Solving the sparse coding problem using block coordinate descent

We’re going to use the same strategy for solving this as before, with a slight change. Here is the strategy we’ll use.

1. First, we generate a random dictionary $\hat{D}$, subject to meeting the orthogonality constraints we’ve imposed on the dictionary atoms.
2. Assuming these fixed dictionaries, we solve the optimization problem for the dictionary atoms $\hat{W}$. These optimization problems are now Chimera-structured QUBOs that fit exactly onto the hardware by construction.
3. Now we fix the weights to these values, and find the optimal dictionary $\hat{D}$, again subject to our constraints.

We then iterate steps 2 and 3 until $G$ converges to a minimum.

Now we’re in a different regime than before — step 2 requires the solution of a large number of chimera-structured QUBOs, not fully connected QUBOs. So that makes those problems better fits to the hardware. But now we have to do some new things to allow for both steps 1 and 3, and these initial steps have some cost.

The first of these is not too hard, and introduces a key concept we’ll use for Step 3 (which is harder). In this post I’ll go over how to do Step 1.

Step 1: Setting up an initial random dictionary that obeys our constraints

Alright so the first step we need to do is to figure out under what conditions we can achieve Step 1.

There is a very interesting result in a paper called Orthogonal Representations and Connectivity of Graphs. Here is a short explanation of the result.

Imagine you have a graph on $V$ vertices. In that graph, each vertex is connected to a bunch of others. Call $p$ the number corresponding to the connectivity of the least connected variable in the graph. Then this paper proves that you can define a set of real vectors in dimension $V - p$ where non-adjacent nodes in the graph can be assigned orthogonal vectors.

So what we want to do — find a random dictionary $\hat{D}$ such that $\vec{d}_j \cdot \vec{d}_m = 0$ for all $k, m$ not connected in hardware — can be done if the length of the vectors $\vec{d}$ is greater than $V - p$.

For Vesuvius, the number $V$ is 512, and the lowest connectivity node in a Chimera graph is $p = 5$. So as long as the dimension of the dictionary atoms is greater than 512 – 5 = 507, we can always perform Step 1.

Here is a little more color on this very interesting result. Imagine you have to come up with two vectors $\vec{g}$ and $\vec{h}$ that are orthogonal (the dot product $\vec{g} \cdot \vec{h}$ is zero). What’s the minimum dimension these vectors have to live in such that this can be done? Well imagine that they both live in one dimension — they are just numbers on a line. Then clearly you can’t do it. However if you have two dimensions, you can. Here’s an example: $\vec{g} = \hat{x}$ and $\vec{h} = \hat{y}$. If you have more that two dimensions, you can also, and the choices you make in this case are not unique.

More generally, if you ask the question “how many orthogonal vectors can I draw in an $V$-dimensional space?”, the answer is $V$ — one vector per dimension. So that is a key piece of the above result. If we had a graph with $V$ vertices where NONE of the vertices were connected to any others (minimum vertex connectivity $p = 0$), and want to assign vectors to each vertex such that all of these vectors are orthogonal to all the others, that’s equivalent to asking “given a $V$-dimensional space, what’s the minimum dimension of a set of vectors such that they are all orthogonal to each other?”, and the answer is $V$.

Now imagine we start drawing edges between some of the vertices in the graph, and we don’t require that the vectors living on these vertices be orthogonal. Conceptually you can think of this as relaxing some constraints, and making it ‘easier’ to find the desired set of vectors — so the minimum dimension of the vectors required so that this will work is reduced as the graph gets more connected. The fascinating result here is the very simple way this works. Just find the lowest connectivity node in the graph, call its connectivity $p$, and then ask “given a graph on $V$ vertices, where the minimum connectivity vertex has connectivity $p$, what’s the minimum dimension of a set of vectors such that non-connected vertices in the graph are all assigned orthogonal vectors?”. The answer is $V - p$.

Null Space

Now just knowing we can do it isn’t enough. But thankfully it’s not hard to think of a constructive procedure to do this. Here is one:

1. Generate a matrix $\hat{D}$ where all entries are random numbers between +1 and -1.
2. Renormalize each column such that each column’s norm is one.
3. For each column in $\hat{D}$ from the leftmost to the rightmost in order, compute the null space of that column, and then replace that column with a random column written in the null space basis.

If you do this you will get an initial random orthonormal basis as required in our new procedure.

By the way, here is some Python code for computing a null space basis for a matrix $\hat{A}$. It’s easy but there isn’t a native function in numpy or scipy that does it.

import numpy
from scipy.linalg import qr

def nullspace_qr(A):

A = numpy.atleast_2d(A)
Q, R = qr(A.T)
ns = Q[:, R.shape[1]:].conj()
return ns

OK so step 1 wasn’t too bad! Now we have to deal with step 3. This is a harder problem, which I’ll tackle in the next post.

# Sparse coding on D-Wave hardware: things that don’t work

Ice, ice baby.

For Christmas this year, my dad bought me a book called Endurance: Shackleton’s Incredible Voyage, by Alfred Lansing.

It is a true story about folks who survive incredible hardship for a long time. You should read it.

Shackleton’s family motto was Fortitudine Vincimus — “by endurance we conquer”. I like this a lot.

On April 22nd, we celebrate the 14th anniversary of the incorporation of D-Wave. Over these past 14 years, nearly everything we’ve tried hasn’t worked. While we haven’t had to eat penguin (yet), and to my knowledge no amputations have been necessary, it hasn’t been a walk in the park. The first ten things you think of always turn out to be dead ends or won’t work for some reason or other.

Here I’m going to share an example of this with the sparse coding problem by describing two things we tried that didn’t work, and why.

Where we got to last time

In the last post, we boiled down the hardness of L0-norm sparse coding to the solution of a large number of QUBOs of the form

Find $\vec{w}$ that minimizes

$G(\vec{w}; \lambda) = \sum_{j=1}^{K} w_j [ \lambda + \vec{d}_j \cdot (\vec{d}_j -2 \vec{z}) ] + 2 \sum_{j \leq m}^K w_j w_m \vec{d}_j \cdot \vec{d}_m$

I then showed that using this form has advantages (at least for getting a maximally sparse encoding of MNIST) over the more typical L1-norm version of sparse coding.

I also mentioned that we used a variant of tabu search to solve these QUBOs. Here I’m going to outline two strategies we tried to use the hardware to beat tabu that ended up not working.

These QUBOs are fully connected, and the hardware isn’t

The terms in the QUBO that connect variables $j$ and $m$ are proportional to the dot product of the $j^{th}$ and $m^{th}$ dictionary atoms $\vec{d}_j$ and $\vec{d}_m$. Because we haven’t added any restrictions on what these atoms need to look like, these dot products can all be non-zero (the dictionary atoms don’t need to be, and in general won’t be, orthogonal). This means that the problems generated by the procedure are all fully connected — each variable is influenced by every other variable.

Unfortunately, when you build a physical quantum computing chip, this full connectivity can’t be achieved. The chip you get to work with connects any given variable with only a small number of other variables.

There are two ways we know of to get around the mismatch of the connectivity of a problem we want to solve, and the connectivity of the hardware. The first is called embedding, and the second is by using the hardware to perform a type of large neighborhood local search as a component of a hybrid algorithm we call BlackBox.

Solving problems by embedding

In a quantum computer, qubits are physically connected to only some of the other qubits. In the most recent spin of our design, each qubit is connected to at most 6 other qubits in a specific pattern which we call a Chimera graph. In our first product chip, Rainier, there were 128 qubits. In the current processor, Vesuvius, there are 512.

Chimera graphs are a way to use a regular repeating pattern to tile out a processor. In Rainier, the processor graph was a four by four tiling of an eight qubit unit cell. For Vesuvius, the same unit cell was used, but with an eight by eight tiling.

For a detailed overview of the rationale behind embedding, and how it works in practice for Chimera graphs, see here and here, which discuss embedding into the 128-qubit Rainier graph (Vesuvius is the same, just more qubits).

The short version is that an embedding is a map from the variables of the problem you wish to solve to the physical qubits in a processor, where the map can be one-to-many (each variable can be mapped to many physical qubits). To preserve the problem structure we strongly ‘lock together’ qubits corresponding to the same variable.

In the case of fully connected QUBOs like the ones we have here, it is known that you can always embed a fully connected graph with $K$ vertices into a Chimera graph with $(K-1)^2/2$ physical qubits — Rainier can embed a fully connected 17 variable graph, while Vesuvius can embed a fully connected 33 variable graph. Shown to the right is an embedding from this paper into Rainier, for solving a problem that computes Ramsey numbers. The processor graph where qubits colored the same represent the same computational variable.

So one way we could use Vesuvius to solve the sparse coding QUBOs is to restrict $K$ to be 33 or less and embed these problems. However this is unsatisfactory for two (related) reasons. The first is that 33 dictionary atoms isn’t enough for what we ultimately want to do (sparse coding on big data sets). The second is that QUBOs generated by the procedure I’ve described are really easy for tabu search at that scale. For problems this small, tabu gives excellent performance with a per problem timeout of about 10 milliseconds (about the same as the runtime for a single problem on Vesuvius), and since it can be run in the cloud, we can take advantage of massive parallellism as well. So even though on a problem by problem basis, Vesuvius is competitive at this scale, when you gang up say 1,000 cores against it, Vesuvius loses (because there aren’t a thousand of them available… yet :-) ).

So this option, while we can do it, is out. At the stage we’re at now this approach can’t compete with cloud-enabled tabu. Maybe when we have a lot more qubits.

Solving sparse coding QUBOs using BlackBox

BlackBox is an algorithm developed at D-Wave. Here is a high level introduction to how it works. It is designed to solve problems where all we’re given is a black box that converts possible answers to binary optimization problems into real numbers denoting how good those possible answers are. For example, the configuration of an airplane wing could be specified as a bit string, and to know how ‘good’ that configuration was, we might need to actually construct that example and put it in a wind tunnel and measure it. Or maybe just doing a large-scale supercomputer simulation is enough. But the relationship between the settings of the binary variables and the quality of the answer in problems like this is not easily specified in a closed form, like we were able to do with the sparse coding QUBOs.

BlackBox is based on tabu search, but uses the hardware to generate a model of the objective function around each search point that expands possibilities for next moves beyond single bit flips. This modelling and sampling from hardware at each tabu step increases the time per step, but decreases the number of steps required to reach some target value of the objective function. As the cost of evaluating the objective function goes up, the gain in making fewer ‘steps’ by making better moves at each tabu step goes up. However if the objective function can be very quickly evaluated, tabu generally beats BlackBox because it can make many more guesses per unit time because of the additional cost of the BlackBox modeling and hardware sampling step.

BlackBox can be applied to arbitrary sized fully connected QUBOs, and because of this is better than embedding because we lose the restriction to small numbers of dictionary atoms. With BlackBox we can try any size problem and see how it does.

We did this, and unfortunately BlackBox on Vesuvius is not competitive with cloud-enabled tabu search for any of the problem sizes we tried (which were, admittedly, still pretty small — up to 50 variables). I suspect that this will continue to hold, no matter how large these problems get, for the following reasons:

1. The inherently parallel nature of the sparse coding problem ($S$ independent QUBOs) means that we will always be up against multiple cores vs. a small number of Vesuvius processors. This factor can be significant — for a large problem with millions of data objects, this factor can easily be in the thousands or tens of thousands.
2. BlackBox is designed for objective functions that are really black boxes, so that there is no obvious way to attack the structure of the problem directly, and where it is very expensive to evaluate the objective function. This is not the case for these problems — they are QUBOs and this means that attacks can be made directly based on this known fact. For these problems, the current version of BlackBox, while it can certainly be used, is not in its sweet spot, and wouldn’t be expected to be competitive with tabu in the cloud.

And this is exactly what we find — BlackBox on Vesuvius is not competitive with tabu on the cloud for any of the problem sizes we tried. Note that there is a small caveat here — it is possible (although I think unlikely) that for very large numbers of atoms (say low thousands) this could change, and BlackBox could start winning. However for both of the reasons listed above I would bet against this.

What to do, what to do

We tried both obvious tactics for using our gear to solve these problems, and both lost to a superior classical approach. So do we give up and go home? Of course not!

We shall go on to the end… we shall never surrender!!!

We just need to do some mental gymnastics here and be creative.

In both of the approaches above, we tried to shoehorn the problem our application generates into the hardware. Neither solution was effective.

So let’s look at this from a different perspective. Is it possible to restrict the problems generated by sparse coding so that they exactly fit in hardware — so that we require the problems generated to exactly match the hardware graph? If we can achieve this, we may be able to beat the classical competition, as we know that Vesuvius is many orders of magnitude faster than anything that exists on earth for the native problems it’s solving.

# Sparse coding on D-Wave hardware: setting up the problem

Sparse coding is a very interesting idea that we’ve been experimenting with. It is a way to find ‘maximally repeating patterns’ in data, and use these as a basis for representing that data.

Some of what’s going to follow here is quite technical. However these are beautiful ideas. They are probably related to how human perception and cognition functions. I think sparse coding is much more interesting and important than, say, the Higgs Boson.

Everything starts from data

Twenty five data objects. Each is a 28×28 pixel greyscale image.

Sparse coding requires data. You can think of ‘data’ as a (usually large) set of objects, where the objects each can be represented by a list of real numbers. As as example, we’ll use the somewhat pathological MNIST handwritten digits data set. But you can use any dataset you can imagine. Each data object in this set is a small (28×28 pixel) greyscale image of a handwritten digit. A 28×28 pixel greyscale image can be represented by 28×28 = 784 numbers, each in the range 0..255. The training set has 60,000 of these.

We can represent the entire data set using a two-dimensional array, where there are 60,000 columns (one per image), and 784 rows (one for each pixel). Let’s call this array ${Z}_0$.

Technical detail: this thing is bigger than it has to be

What the first few MNIST images look like, keeping an increasingly large number of SVD modes.

One thing you may notice about MNIST is that each image kind of looks mostly similar. In fact we can exploit this to get a quick compression of the data. The trick we will use is called Singular Value Decomposition (SVD). SVD quickly finds a representation of the data that allows you to reduce its dimensionality. In the case of MNIST, it turns out that instead of using 784 pixel values, we can get away with using around 30 SVD modes instead, with only a small degradation in image quality. If we perform an SVD on ${Z}_0$ and keep only the parts corresponding to the largest 30 singular values, we get a new matrix ${Z}$ which still has 60,000 columns, but only 30 rows. Let’s call the $s^{th}$ column vector $\vec{z}_s$, which stores the $s^{th}$ compressed image. This trick, and others related to it, can always be used to pre-process any raw data we have.

The Dictionary

Let’s now create a small number — let’s say 32 — of very special images. The exact number of these actually matters quite a bit (it’s an example of something called a hyperparameter, many of which lurk in machine learning algorithms and are generally a giant pain in the ass), but the most important thing is that it should be larger than the dimension of the data (which in this case is 30). When this is the case, it’s called an overcomplete basis, which is good for reasons you can read about here.

These images will be in the exact same format as the data we’re learning over. We’ll put these in a new two dimensional array, which will have 32 columns (one for each image) and 30 rows (the same as the post-SVD compressed images above). We’ll call this array $\hat{D}$ the dictionary. Its columns are dictionary atoms. The $j^{th}$ dictionary atom is the column vector $\vec{d}_j$. These dictionary atoms will store the ‘maximally repeating patterns’ in our data that are what we’re looking for.

At this stage, we don’t know what they are — we need to learn them from the data.

The sparse coding problem

Now that we have a data array, and a placeholder array for our dictionary, we are ready to state the sparse coding problem.

Now to see how all this works, imagine we want to reconstruct an image (say $\hat{z}_s$) with our dictionary atoms. Imagine we can only either include or exclude each atom, and the reconstruction is a linear combination of the atoms. Furthermore, we want the reconstruction to be sparse, which means we only want to turn on a small number of our atoms. We can formalize this by asking for the solution of an optimization problem.

L0-norm sparse coding

Define a vector of binary (0/1) numbers $\vec{w}$ of length 32. Now solve this problem:

Find $\vec{w}$ and $\hat{D} = [\vec{d}_1 \vec{d}_2... \vec{d}_{31} \vec{d}_{32}]$ (remember the vectors $\vec{d}_k$ are column vectors in the matrix $\hat{D}$) that minimize

$G(\vec{w}, \hat{D} ; \lambda) = || \vec{z}_s - \sum_{k=1}^{32} w_k \vec{d}_k ||^2 + \lambda \sum_{k=1}^{32} w_k$

The real number $\lambda$ is called a regularization parameter (another one of those hyperparameters). The larger this number is, the bigger the penalty for adding more dictionary atoms to the reconstruction — the rightmost term counts the number of atoms used in the reconstruction. The first term is a measure of the reconstruction error. Minimizing this means minimizing the distance between the data (sometimes called the ground truth) $\vec{z}_s$ and the reconstruction of the image $\sum_{k=1}^{32} w_k \vec{d}_k$.

[Note that this prescription for sparse coding is different that the one typically used, where the weights $\vec{w}$ are real valued and the regularization term is of the L1 (sum over absolute values of the weights) form.]

You may see a simple way to globally optimize this. All you have to do is set $\vec{d}_1 = \vec{z}_s$, $\vec{d}_k = 0$ for all the other k, $w_1 = 1$ and $w_k = 0$ for all the other k — in other words, store the image in one of the dictionary atoms and only turn that one on to reconstruct. Then the reconstruction is perfect, and you only used one dictionary atom to do it!

OK well that’s useless. But now say we sum over all the images in our data set (in the case of MNIST this is 60,000 images). Now instead of the number of images being less than the number of dictionary atoms, it’s much, much larger. The trick of just ‘memorizing’ the images in the dictionary won’t work any more.

Now over all the data

Let’s call the total number of data objects $S$. In our MNIST case $S = 60,000$. We now need to find $\hat{W}$ (this is a 32xS array) and $\hat{D}$ that minimize

$G(\hat{W}, \hat{D} ; \lambda) = \sum_{s=1}^S || \vec{z}_{s} - \sum_{k=1}^{32} w_{ks} \vec{d}_k ||^2 + \lambda \sum_{s=1}^S \sum_{k=1}^{32} w_{ks}$

This is the full sparse coding prescription — in the next post I’ll describe how to solve it, what the results look like, and introduce a bunch of ways to make good use of the dictionary we’ve just learned!

# Quantum computing for learning: Shaping reality both figuratively and literally

I’d like to document in snippets and thought-trains a little more of the story behind how my co-workers and I are trying to apply quantum computing to the field of intelligence and learning. I honestly think that this is the most fascinating and cool job in the world. The field of Artificial Intelligence (AI) – after a period of slowdown – is now once again exploding with possibility. Big data, large-scale learning, deep networks, high performance computing, bio-inspired architectures… There have been so many advancements lately that it’s kind of hard to keep up! Similarly, the work being done on quantum information processing here at D-Wave is ushering in a new computational revolution. So being a multi-disciplinary type and somewhat masochistic, I find it exciting to explore just how far we can take the union of these two fields.

Approximately 5 years ago, while working on my PhD at University, I started to have an interest in whether or not quantum computing could be used to process information in a brain-like way. I’m trying to remember where this crazy obsession came from. I’d been interested in brains and artificial intelligence for a while, and done a bit of reading on neural nets. Probably because one of my friends was doing a PhD that involved robots and I always thought it sounded super-cool. But I think that it was thinking about Josephson junctions that really got me wondering. Josephson junctions are basically switches. But they kind of have unusual ways of switching (sometimes when you don’t want them to). And because of this, I always thought that Josephson junctions are a bit like neurons. So I started searching the literature to find ways in which researchers had used these little artificial neurons to compute. And surprisingly, I found very little. There were some papers about how networks of Josephson junctions could be used to simulate neurons, but no-one had actually built anything substantial. I wrote a bit about this in a couple of old posts (from Physics and Cake blog):

I’d read about the D-Wave architecture and I’d been following the company’s progress for some time. After reading a little about the promise of Josephson junction networks, and the pitfalls of the endeavour (mostly because making the circuits reproducible is extremely difficult), I then began wondering whether or not the D-Wave processor could be used in this way. It’s a network of qubits made from Josephson junctions after all, and they’re connected together so that they talk to each other. Yeah, kind of like neurons really. Isn’t that funny. And hey, those D-Wave types have spent 8 years getting that network of Josephson junctions to behave itself. Getting it to be programmable, addressable, robust, and scalable. Hmm, scalable…. I particularly like that last one. Brains are like, big. Lotsa connections. And also, I thought to myself (probably over tea and cake), if the neurons are qubits, doesn’t that mean you can put them in superposition and entangled states? What would that even mean? Boy, that sounds cool. Maybe they would process information differently, and maybe they could even learn faster if they could be in combinations of states at the same time and … could you build a small one and try it out?

The train of thought continued.

From quantum physics to quantum brains

That was before I joined D-Wave. Upon joining the company, I got to work applying some of my physics knowledge to helping build and test the processors themselves. However there was a little part of me that still wanted to actually find ways to use them. Not too long after I had joined the company there happened to be a competition run internally at D-Wave known as ‘Apps Day’, open to everyone in the company, where people were encouraged to try to write an app for the quantum computer. Each candidate got to give a short presentation describing their app, and there were prizes at stake.

I decided to try and write an app that would allow the quantum computer to learn how to play the board game Go. It was called QUAGGA, named after an extinct species of zebra. As with similar attempts involving the ill-fated zebra, I too might one day try to resurrect my genetically-inferior code. Of course this depends on whether or not I ever understand the rules of Go well enough to program it properly :) Anyway… back to Apps Day. There were several entries and I won a runner-up prize (my QUAGGA app idea was good even though I hadn’t actually finished coding it or run it on the hardware). But the experience got me excited and I wanted to find out more about how I could apply quantum processing to applications, especially those in the area of machine learning and AI.

That’s why I moved from physics into applications development.

Since then the team I joined has been looking into applying quantum technology to various areas of machine learning, in a bid to unite two fields which I have a really strong feeling are made for each other. I’ve tried to analyse where this hunch originates from. The best way to describe it is that I really want to create models of machine intelligence and creativity that are bio-inspired. To do that I believe that you have to take inspiration from the mammalian brain, such as its highly parallel, hierarchical arrangement of substructures. And I just couldn’t help but keep thinking: D-Wave’s processors are highly parallel systems with qubits that can be in one of two states (similar to firing or not firing neurons) with connections between them that can be inhibitory or excitory. Moreover, like the brain, these systems, are INCREDIBLY energy efficient because they are designed to do parallel processing. Modern CPUs are not – hence why brain simulations and machine learning programs take so much energy and require huge computer clusters to run. I believe we need to explore many different hardware and software architectures if we want to get smarter about intelligent computing and closer to the way our own minds work. Quantum circuits are a great candidate in that hunt for cool brain-like processing in silicon.

So what on earth happened here? I’d actually found a link between my two areas of obsession interest and ended up working on some strange joint project that combined the best of both worlds. Could this be for real? I kept thinking that maybe I wanted to believe so badly I was just seeing the machine-learning messiah in a piece of quantum toast. However, even when I strive to be truly objective, I still find a lot of evidence that the results of this endeavour could be very fruitful.

Our deep and ever-increasing understanding of physics (including quantum mechanics) is allowing us to harness and shape the reality of the universe to create new types of brains. This is super-cool. However, the thing I find even cooler is that if you work hard enough at something, you may discover that several fascinating areas are related in a deeper way than you previously understood. Using this knowledge, you can shape the reality of your own life to create a new, hybrid project idea to work on; one which combines all the things you love doing.