# Sparse coding on D-Wave hardware: structured dictionaries

The underlying problem we saw last time, that prevented us from using the hardware to compete with tabu on the cloud, was the mismatch of the connectivity of the problems sparse coding generates (which are fully connected) and the connectivity of the hardware.

The source of this mismatch is the quadratic term in the objective function, which for the $j^{th}$ and $m^{th}$ variables is proportional to $\vec{d}_j \cdot \vec{d}_m$. The coupling terms are proportional to the dot product of the dictionary atoms.

Here’s an idea. What if we demand that $\vec{d}_j \cdot \vec{d}_m$ has to be zero for all pairs of variables $j$ and $m$ that are not connected in hardware? If we can achieve this structure in the dictionary, we get a very interesting result. Instead of being fully connected, the QUBOs with this restriction can be engineered to exactly match the underlying problem the hardware solves. If we can do this, we get closer to using the full power of the hardware.

L0-norm sparse coding with structured dictionaries

Here is the idea.

Given

1. A set of $S$ data objects $\vec{z}_s$, where each $\vec{z}_s$ is a real valued vector with $N$ components;
2. An $N x K$ real valued matrix $\hat{D}$, where $K$ is the number of dictionary atoms we choose, and we define its $k^{th}$ column to be the vector $\vec{d}_k$;
3. A $K x S$ binary valued matrix $\hat{W}$;
4. And a real number $\lambda$, which is called the regularization parameter,

Find $\hat{W}$ and $\hat{D}$ that minimize

$G(\hat{W}, \hat{D} ; \lambda) = \sum_{s=1}^S || \vec{z}_{s} - \sum_{k=1}^{K} w_{ks} \vec{d}_k ||^2 + \lambda \sum_{s=1}^S \sum_{k=1}^{K} w_{ks}$

subject to the constraints that $\vec{d}_j \cdot \vec{d}_m = 0$ for all pairs $j,m$ that are not connected in the quantum chip being used.

The only difference here from what we did before is the last sentence, where we add a set of constraints on the dictionary atoms.

Solving the sparse coding problem using block coordinate descent

We’re going to use the same strategy for solving this as before, with a slight change. Here is the strategy we’ll use.

1. First, we generate a random dictionary $\hat{D}$, subject to meeting the orthogonality constraints we’ve imposed on the dictionary atoms.
2. Assuming these fixed dictionaries, we solve the optimization problem for the dictionary atoms $\hat{W}$. These optimization problems are now Chimera-structured QUBOs that fit exactly onto the hardware by construction.
3. Now we fix the weights to these values, and find the optimal dictionary $\hat{D}$, again subject to our constraints.

We then iterate steps 2 and 3 until $G$ converges to a minimum.

Now we’re in a different regime than before — step 2 requires the solution of a large number of chimera-structured QUBOs, not fully connected QUBOs. So that makes those problems better fits to the hardware. But now we have to do some new things to allow for both steps 1 and 3, and these initial steps have some cost.

The first of these is not too hard, and introduces a key concept we’ll use for Step 3 (which is harder). In this post I’ll go over how to do Step 1.

Step 1: Setting up an initial random dictionary that obeys our constraints

Alright so the first step we need to do is to figure out under what conditions we can achieve Step 1.

There is a very interesting result in a paper called Orthogonal Representations and Connectivity of Graphs. Here is a short explanation of the result.

Imagine you have a graph on $V$ vertices. In that graph, each vertex is connected to a bunch of others. Call $p$ the number corresponding to the connectivity of the least connected variable in the graph. Then this paper proves that you can define a set of real vectors in dimension $V - p$ where non-adjacent nodes in the graph can be assigned orthogonal vectors.

So what we want to do — find a random dictionary $\hat{D}$ such that $\vec{d}_j \cdot \vec{d}_m = 0$ for all $k, m$ not connected in hardware — can be done if the length of the vectors $\vec{d}$ is greater than $V - p$.

For Vesuvius, the number $V$ is 512, and the lowest connectivity node in a Chimera graph is $p = 5$. So as long as the dimension of the dictionary atoms is greater than 512 – 5 = 507, we can always perform Step 1.

Here is a little more color on this very interesting result. Imagine you have to come up with two vectors $\vec{g}$ and $\vec{h}$ that are orthogonal (the dot product $\vec{g} \cdot \vec{h}$ is zero). What’s the minimum dimension these vectors have to live in such that this can be done? Well imagine that they both live in one dimension — they are just numbers on a line. Then clearly you can’t do it. However if you have two dimensions, you can. Here’s an example: $\vec{g} = \hat{x}$ and $\vec{h} = \hat{y}$. If you have more that two dimensions, you can also, and the choices you make in this case are not unique.

More generally, if you ask the question “how many orthogonal vectors can I draw in an $V$-dimensional space?”, the answer is $V$ — one vector per dimension. So that is a key piece of the above result. If we had a graph with $V$ vertices where NONE of the vertices were connected to any others (minimum vertex connectivity $p = 0$), and want to assign vectors to each vertex such that all of these vectors are orthogonal to all the others, that’s equivalent to asking “given a $V$-dimensional space, what’s the minimum dimension of a set of vectors such that they are all orthogonal to each other?”, and the answer is $V$.

Now imagine we start drawing edges between some of the vertices in the graph, and we don’t require that the vectors living on these vertices be orthogonal. Conceptually you can think of this as relaxing some constraints, and making it ‘easier’ to find the desired set of vectors — so the minimum dimension of the vectors required so that this will work is reduced as the graph gets more connected. The fascinating result here is the very simple way this works. Just find the lowest connectivity node in the graph, call its connectivity $p$, and then ask “given a graph on $V$ vertices, where the minimum connectivity vertex has connectivity $p$, what’s the minimum dimension of a set of vectors such that non-connected vertices in the graph are all assigned orthogonal vectors?”. The answer is $V - p$.

Null Space

Now just knowing we can do it isn’t enough. But thankfully it’s not hard to think of a constructive procedure to do this. Here is one:

1. Generate a matrix $\hat{D}$ where all entries are random numbers between +1 and -1.
2. Renormalize each column such that each column’s norm is one.
3. For each column in $\hat{D}$ from the leftmost to the rightmost in order, compute the null space of that column, and then replace that column with a random column written in the null space basis.

If you do this you will get an initial random orthonormal basis as required in our new procedure.

By the way, here is some Python code for computing a null space basis for a matrix $\hat{A}$. It’s easy but there isn’t a native function in numpy or scipy that does it.

import numpy
from scipy.linalg import qr

def nullspace_qr(A):

A = numpy.atleast_2d(A)
Q, R = qr(A.T)
ns = Q[:, R.shape[1]:].conj()
return ns

OK so step 1 wasn’t too bad! Now we have to deal with step 3. This is a harder problem, which I’ll tackle in the next post.

# Sparse coding on D-Wave hardware: things that don’t work

Ice, ice baby.

For Christmas this year, my dad bought me a book called Endurance: Shackleton’s Incredible Voyage, by Alfred Lansing.

It is a true story about folks who survive incredible hardship for a long time. You should read it.

Shackleton’s family motto was Fortitudine Vincimus — “by endurance we conquer”. I like this a lot.

On April 22nd, we celebrate the 14th anniversary of the incorporation of D-Wave. Over these past 14 years, nearly everything we’ve tried hasn’t worked. While we haven’t had to eat penguin (yet), and to my knowledge no amputations have been necessary, it hasn’t been a walk in the park. The first ten things you think of always turn out to be dead ends or won’t work for some reason or other.

Here I’m going to share an example of this with the sparse coding problem by describing two things we tried that didn’t work, and why.

Where we got to last time

In the last post, we boiled down the hardness of L0-norm sparse coding to the solution of a large number of QUBOs of the form

Find $\vec{w}$ that minimizes

$G(\vec{w}; \lambda) = \sum_{j=1}^{K} w_j [ \lambda + \vec{d}_j \cdot (\vec{d}_j -2 \vec{z}) ] + 2 \sum_{j \leq m}^K w_j w_m \vec{d}_j \cdot \vec{d}_m$

I then showed that using this form has advantages (at least for getting a maximally sparse encoding of MNIST) over the more typical L1-norm version of sparse coding.

I also mentioned that we used a variant of tabu search to solve these QUBOs. Here I’m going to outline two strategies we tried to use the hardware to beat tabu that ended up not working.

These QUBOs are fully connected, and the hardware isn’t

The terms in the QUBO that connect variables $j$ and $m$ are proportional to the dot product of the $j^{th}$ and $m^{th}$ dictionary atoms $\vec{d}_j$ and $\vec{d}_m$. Because we haven’t added any restrictions on what these atoms need to look like, these dot products can all be non-zero (the dictionary atoms don’t need to be, and in general won’t be, orthogonal). This means that the problems generated by the procedure are all fully connected — each variable is influenced by every other variable.

Unfortunately, when you build a physical quantum computing chip, this full connectivity can’t be achieved. The chip you get to work with connects any given variable with only a small number of other variables.

There are two ways we know of to get around the mismatch of the connectivity of a problem we want to solve, and the connectivity of the hardware. The first is called embedding, and the second is by using the hardware to perform a type of large neighborhood local search as a component of a hybrid algorithm we call BlackBox.

Solving problems by embedding

In a quantum computer, qubits are physically connected to only some of the other qubits. In the most recent spin of our design, each qubit is connected to at most 6 other qubits in a specific pattern which we call a Chimera graph. In our first product chip, Rainier, there were 128 qubits. In the current processor, Vesuvius, there are 512.

Chimera graphs are a way to use a regular repeating pattern to tile out a processor. In Rainier, the processor graph was a four by four tiling of an eight qubit unit cell. For Vesuvius, the same unit cell was used, but with an eight by eight tiling.

For a detailed overview of the rationale behind embedding, and how it works in practice for Chimera graphs, see here and here, which discuss embedding into the 128-qubit Rainier graph (Vesuvius is the same, just more qubits).

The short version is that an embedding is a map from the variables of the problem you wish to solve to the physical qubits in a processor, where the map can be one-to-many (each variable can be mapped to many physical qubits). To preserve the problem structure we strongly ‘lock together’ qubits corresponding to the same variable.

In the case of fully connected QUBOs like the ones we have here, it is known that you can always embed a fully connected graph with $K$ vertices into a Chimera graph with $(K-1)^2/2$ physical qubits — Rainier can embed a fully connected 17 variable graph, while Vesuvius can embed a fully connected 33 variable graph. Shown to the right is an embedding from this paper into Rainier, for solving a problem that computes Ramsey numbers. The processor graph where qubits colored the same represent the same computational variable.

So one way we could use Vesuvius to solve the sparse coding QUBOs is to restrict $K$ to be 33 or less and embed these problems. However this is unsatisfactory for two (related) reasons. The first is that 33 dictionary atoms isn’t enough for what we ultimately want to do (sparse coding on big data sets). The second is that QUBOs generated by the procedure I’ve described are really easy for tabu search at that scale. For problems this small, tabu gives excellent performance with a per problem timeout of about 10 milliseconds (about the same as the runtime for a single problem on Vesuvius), and since it can be run in the cloud, we can take advantage of massive parallellism as well. So even though on a problem by problem basis, Vesuvius is competitive at this scale, when you gang up say 1,000 cores against it, Vesuvius loses (because there aren’t a thousand of them available… yet ).

So this option, while we can do it, is out. At the stage we’re at now this approach can’t compete with cloud-enabled tabu. Maybe when we have a lot more qubits.

Solving sparse coding QUBOs using BlackBox

BlackBox is an algorithm developed at D-Wave. Here is a high level introduction to how it works. It is designed to solve problems where all we’re given is a black box that converts possible answers to binary optimization problems into real numbers denoting how good those possible answers are. For example, the configuration of an airplane wing could be specified as a bit string, and to know how ‘good’ that configuration was, we might need to actually construct that example and put it in a wind tunnel and measure it. Or maybe just doing a large-scale supercomputer simulation is enough. But the relationship between the settings of the binary variables and the quality of the answer in problems like this is not easily specified in a closed form, like we were able to do with the sparse coding QUBOs.

BlackBox is based on tabu search, but uses the hardware to generate a model of the objective function around each search point that expands possibilities for next moves beyond single bit flips. This modelling and sampling from hardware at each tabu step increases the time per step, but decreases the number of steps required to reach some target value of the objective function. As the cost of evaluating the objective function goes up, the gain in making fewer ‘steps’ by making better moves at each tabu step goes up. However if the objective function can be very quickly evaluated, tabu generally beats BlackBox because it can make many more guesses per unit time because of the additional cost of the BlackBox modeling and hardware sampling step.

BlackBox can be applied to arbitrary sized fully connected QUBOs, and because of this is better than embedding because we lose the restriction to small numbers of dictionary atoms. With BlackBox we can try any size problem and see how it does.

We did this, and unfortunately BlackBox on Vesuvius is not competitive with cloud-enabled tabu search for any of the problem sizes we tried (which were, admittedly, still pretty small — up to 50 variables). I suspect that this will continue to hold, no matter how large these problems get, for the following reasons:

1. The inherently parallel nature of the sparse coding problem ($S$ independent QUBOs) means that we will always be up against multiple cores vs. a small number of Vesuvius processors. This factor can be significant — for a large problem with millions of data objects, this factor can easily be in the thousands or tens of thousands.
2. BlackBox is designed for objective functions that are really black boxes, so that there is no obvious way to attack the structure of the problem directly, and where it is very expensive to evaluate the objective function. This is not the case for these problems — they are QUBOs and this means that attacks can be made directly based on this known fact. For these problems, the current version of BlackBox, while it can certainly be used, is not in its sweet spot, and wouldn’t be expected to be competitive with tabu in the cloud.

And this is exactly what we find — BlackBox on Vesuvius is not competitive with tabu on the cloud for any of the problem sizes we tried. Note that there is a small caveat here — it is possible (although I think unlikely) that for very large numbers of atoms (say low thousands) this could change, and BlackBox could start winning. However for both of the reasons listed above I would bet against this.

What to do, what to do

We tried both obvious tactics for using our gear to solve these problems, and both lost to a superior classical approach. So do we give up and go home? Of course not!

We shall go on to the end… we shall never surrender!!!

We just need to do some mental gymnastics here and be creative.

In both of the approaches above, we tried to shoehorn the problem our application generates into the hardware. Neither solution was effective.

So let’s look at this from a different perspective. Is it possible to restrict the problems generated by sparse coding so that they exactly fit in hardware — so that we require the problems generated to exactly match the hardware graph? If we can achieve this, we may be able to beat the classical competition, as we know that Vesuvius is many orders of magnitude faster than anything that exists on earth for the native problems it’s solving.

# It’s like the quantum computer is playing 20 questions…

I’ve been thinking about the BlackBox compiler recently and came up with a very interesting analogy to the way it works. There are actually lots of different ways to think about how BlackBox works, and we’ll post more of them over time, but here is a very high level and fun one.

The main way that you use BlackBox is to supply it with a classical function which computes the “goodness” of a given bitstring by returning a real number (the lower this number, the better the bitstring was).

Whatever your optimization problem is, you need to write a function that encodes your problem into a series of bits (x1, x2, x3…. xN) to be discovered, and which also computes how “good” a given bitstring (e.g. 0,1,1…0) is. When you pass such a function to Blackbox, the quantum compiler then repeatedly comes up with ideas for bitstrings, and using the information that your function supplies about how good its “guesses” are, it quickly converges on the best bitstring possible.

So using this approach the quantum processor behaves as a co-processor to a classical computing resource. The classical computing resources handles one part of the problem (computing the goodness of a given bitstring), and the quantum computer handles the other (suggesting bitstrings). I realized that this is described very nicely by the two computers playing 20 questions with one another.

The quantum computer suggests creative solutions to a problem, and then the classical computer is used to give feedback on how good the suggested solution is. Using this feedback, BlackBox will intelligently suggest a new solution. So in the example above, Blackbox knows NOT to make the next question “Is it a carrot?”

There is actually a deep philosophical point here. One of the pieces that is missing in the puzzle of artificial intelligence is how to make algorithms and programs more creative. I have always been an advocate of using quantum computing to power AI, but we now start to see concrete ways in which it could really start to address some of the elusive problems that crop up when trying to build intelligent machines.

At D-Wave, we have been starting some initial explorations in the areas of machine creativity and machine dreams, but it is early days and the pieces are only just starting to fall into place.

I was wondering if you could use the QC to actually play 20 questions for real. This is quite a fun application idea. If anyone has any suggestions for how to craft 20 questions into an objective function, let me know. My first two thoughts were to do something with Wordnet and NLTK. You could try either a pattern matching or a machine learning version of ‘mining’ wordnet for the right answer. This project would be a little Watson-esque in flavour.

# New tutorials on devPortal: WMIS and MCS

There are two new tutorials on the website, complete with code snippets! Click on the images to go to the tutorial pages on the developer portal:

This tutorial (above) describes how to solve Weighted Maximum Independent Set (WMIS) problems using the hardware. Finding the Maximum Independent Set of a bunch of connected variables can be very useful. At a high level, the MIS it gives us information about the largest number of ‘things’ that can be achieved from a set when lots of those ‘things’ have conflicting requirements. In the tutorial, an example is given of scheduling events for a sports team, but you can imagine all sorts of variants: Train timetabling to improve services, assigning patients to surgeons to maximize the throughput of vital operations and minimize waiting lists, adjusting variable speed limits on motorways to reduce traffic jams during periods of congestion, etc etc.

This tutorial (above) describes how to find Maximum Common Subgraphs given two graphs. The example given in this tutorial is in molecule matching. Looking for areas where sub-structures in molecules are very similar can give us information about how such molecules behave. This is just one simple example of MCS. You can also imagine the same technique being applied to social networks to look for matches between the structuring of social groups. This technique could be used for improving ad placement or even for detecting crime rings.

These two tutorials are closely linked – as finding the MCS involves finding the MIS as part of the process. There are also lots of interesting applications of both these methods in graph and number theory.

If anyone would like to implement WMIS or MCS to solve any of the problem ideas mentioned in this post, please feel free!

# The dreams of spiritual machines

When I was in middle school, every year we had to select a project to work on. These projects came from a list of acceptable projects. The projects were typical science-ish projects you’d expect a seventh grader to take on. One year my project was about whooping cranes. Not sure why I picked that one. Maybe I thought it might be related to whooping cough.

One year the subject I picked was dreams. What were they? How did they come about? What, if anything did they tell us about our waking life? I remember being intensely fascinated by the topic at the time, feeling that the answers I was getting to my questions from grown-ups and the encyclopedias checked out from the school library (there was no internet back then, at least in a form I could access) were not satisfactory at all. This was one of my earliest realizations that there were questions no-one yet knew the answers to.

The subject of dreams has come up in my adult life several times, and each time the same questions about them bubble up from my early encounter with them. An acquaintance of mine went through a period of having night terrors, where she would scream so loud that it would wake people in neighboring houses. She described them as being a sense of horror and dread of the most intense and indescribable kind, with sure knowledge that it would never end. This led to multiple 911 calls over periods of years. Several trips to specialists and tests revealed nothing out of the ordinary. Then one day they suddenly stopped. To this day no one has a good explanation for why they started, or why they stopped.

One of my friends has multiple vivid, realistic dreams every night, and he remembers them. They are also often terrifying. I on the other hand rarely dream, or if I do, I don’t remember them.

Recently I have been thinking of dreams again, and I have four computer scientists to thank. One of them is Bill Macready, who is my friend and colleague at D-Wave, and inventor of the framework I’ll introduce shortly. The second is Douglas Hofstadter. The third is Geoff Hinton. The fourth is David Gelertner.

Gelertner is a very interesting guy. Not only is he a rock star computer scientist (Bill Joy called him “one of the most brilliant and visionary computer scientists of our time”), he is also an artist, entrepreneur and a writer with an MA is classical literature. He was injured badly opening a package from the Unabomber in 1993. He is the author of several books, but the one I want to focus on now is The Muse in the Machine, which  is must-read material for anyone interested in artificial intelligence.

In this book, Gelertner presents a compelling theory of cognition that includes emotion, creativity and dreams as a central critically important aspect of the creation of machines that think, feel and act as we do. In this theory, emotion, creativity, analogical thought and even spirituality are viewed as being essential to the creation of machines that behave as humans do. I can’t do the book justice in a short post – you should read it.

I am going to pull one quote out of the book though, but before I do I want to briefly touch on what Geoff Hinton has to do with all of this. Hinton is also a rock star in the world of artificial intelligence, and in particular in machine learning. He was one of the inventors of back propagation, and a pioneer in deep belief nets and unsupervised learning. A fascinating demo I really like starts around the 20:00 mark of this video. In this demo, he runs a deep learning system ‘in reverse’, in generative mode. Hinton refers to this process as the system “fantasizing” about the images it’s generating; however Hinton’s fantasizing can also be thought of as the system hallucinating, or even dreaming, about the subjects it has learned. Systems such as these exhibit what I believe to be clear instances of creativity – generating instances of objects that have never existed in the world before, but share some underlying property. In Hinton’s demo, this property is “two-ness”.

Alright so back to Gelertner, and the quote from The Muse in the Machine:

A computer that never hallucinates cannot possibly aspire to artificial thought.

While Gelertner speaks a somewhat different language than Hinton, I believe that the property of a machine that he is referring to here – the ability to hallucinate, fantasize or dream – is exactly the sort of thing Hinton is doing with his generative digit model. When you run that model I would argue that you are seeing the faintest wisps of the beginning of true cognition in a machine.

Douglas Hofstadter is probably the most famous of the four computer scientists I’ve been thinking about recently. He is of course the author of Godel, Escher, Bach, which every self-respecting technophile has read, but more importantly he has been a proponent for the need to think about cognition from a very different perspective than most computer scientists. For Hofstadter, creativity and analogical reasoning are the key points of interest he feels we need to understand in order to understand our own cognition. Here he is in the “Pattern-finding as the Core of Intelligence” introduction to his Fluid Analogies book:

In 1977, I began my new career as a professor of computer science, aiming to specialize in the field of artificial intelligence. My goals were modest, at least in number: first, to uncover the secrets of creativity, and second, to uncover the secrets of consciousness, by modeling both phenomena on a computer. Good goals. Not easy.

All four of these folks share a perspective that understanding how analogical thinking and creativity work is an important and under-studied part of building machines like us.

Recently we’ve been working on a series of projects that are aligned with this sort of program. The basic framework is introduced here, in an introductory tutorial.

This basic introduction is extended here.

One of the by-products of this work is a computing system that generates vivid dreamscapes. You can look at one of these by clicking on the candle photograph above, or by following through the Temporal QUFL tutorial, or by clicking on the direct link below.

The technical part of how these dreamscapes are generated is described in these tutorials.  I believe these ideas are important. These dreamscapes remind me of H.P. Lovecraft’s Dreamlands, and this from Celephais:

There are not many persons who know what wonders are opened to them in the stories and visions of their youth; for when as children we learn and dream, we think but half-formed thoughts, and when as men we try to remember, we are dulled and prosaic with the poison of life. But some of us awake in the night with strange phantasms of enchanted hills and gardens, of fountains that sing in the sun, of golden cliffs overhanging murmuring seas, of plains that stretch down to sleeping cities of bronze and stone, and of shadowy companies of heroes that ride caparisoned white horses along the edges of thick forests; and then we know that we have looked back through the ivory gates into that world of wonder which was ours before we were wise and unhappy.

I hope you like them.

# Videos from the NASA Quantum Future Technologies Conference

There are a bunch of cool videos available online from presentations at the recent NASA Quantum Future Technologies conference.

In case you don’t want to watch all the talks (there are a lot!), I’ll point out a few which are most relevant to this blog. If you have time though I’d recommend browsing the full list of talks, as there were lots of extremely interesting themes at the conference!

One of the cool things about this conference was just how much of a focus there was on Adiabatic Quantum Computing. Hopefully the talks will inspire even more people to view this form of quantum computing as scalable, robust and useful for problem-solving.

Click on the images to watch the talks via Adobe connect:

First is Geordie’s talk on Machine Learning using the D-Wave One system – presenting D-Wave’s latest experimental results on applying the quantum computing system to learning problems such as image compression, image recognition and object detection and tracking:

Hartmut Neven works at Google on image search and computer vision applications. His talk describes how the quantum computing technology at D-Wave is being applied to some challenging problems in this field:

Mohammad Amin from D-Wave describes how noise comes into play in the Adiabatic Quantum Computing model, and explains how the adverse effects of decoherence can be avoided, and do not disrupt the quantum computation when the system is designed correctly:

Frank Gaitan describes how the D-Wave One system can be used to explore a problem in graph theory called Ramsey numbers:

Sergio Boixo describes using the D-Wave One for Adiabatic Quantum Machine Learning and how this relates to the ‘clean energy project’ being performed at USC:

# Quantum computing and light switches

So as part of learning how to become a quantum ninja and program the D-Wave One, it is important to understand the problem that the machine is designed to solve. The D-Wave machine is designed to find the minimum value of a particular mathematical expression which I can write down in one line:

As people tend to be put off by mathematical equations in blogposts, I decided to augment it with a picture of a cute cat. However, unless you are very mathematically inclined (like kitty), it might not be intuitive what minimizing this expression actually means, why it is important, or how quantum computing helps. So I’m going to try to answer those three questions in this post.

.
1.) What does the cat’s expression mean?

The machine is designed to solve discrete optimization problems. What is a discrete optimization problem? It is one where you are trying to find the best settings for a bunch of switches. Here’s a graphical example of what is going on. Let’s imagine that our switches are light switches which each have a ‘bias value’ (a number) associated with them, and they can each be set either ON or OFF:

The light switch game

The game that we must play is to set all the switches into the right configuration. What is the right configuration? It is the one where when we set each of the switches to either ON or OFF (where ON = +1 and OFF = -1) and then we add up all the switches’ bias values multiplied by their settings, we get the lowest answer. This is where the first term in the cat’s expression comes from. The bias values are called h’s and the switch settings are called s’s.

So depending upon which switches we set to +1 and which we set to -1, we will get a different score overall. You can try this game. Hopefully you’ll find it easy because there’s a simple rule to winning:

We find that if we set all the switches with positive biases to OFF and all the switches with negative biases to ON and add up the result then we get the lowest overall value. Easy, right? I can give you as many switches as I want with many different bias values and you just look at each one in turn and flip it either ON or OFF accordingly.

OK, let’s make it harder. So now imagine that many of the pairs of switches have an additional rule, one which involves considering PAIRS of switches in addition to just individual switches… we add a new bias value (called J) which we multiply by BOTH the switch settings that connect to it, and we add the resulting value we get from each pair of switches to our overall number too. Still, all we have to do is decide whether each switch should be ON or OFF subject to this new rule.

But now it is much, much harder to decide whether a switch should be ON or OFF, because its neighbours affect it. Even with the simple example shown with 2 switches in the figure above, you can’t just follow the rule of setting them to be the opposite sign to their bias value anymore (try it!). With a complex web of switches having many neighbours, it quickly becomes very frustrating to try and find the right combination to give you the lowest value overall.

.

2.) It’s a math expression – who cares?

We didn’t build a machine to play a strange masochistic light switch game. The concept of finding a good configuration of binary variables (switches) in this way lies at the heart of many problems that are encountered in everyday applications. A few are shown in figure below (click to expand):

Even the idea of doing science itself is an optimization problem (you are trying to find the best ‘configuration’ of terms contributing to a scientific equation which matches our real world observations).

.
3.) How does quantum mechanics help?

With a couple of switches you can just try every combination of ON’s and OFF’s, there are only four possibilities: [ON ON], [ON OFF], [OFF ON] or [OFF OFF]. But as you add more and more switches, the number of possible ways that the switches can be set grows exponentially:

You can start to see why the game isn’t much fun anymore. In fact it is even difficult for our most powerful supercomputers. Being able to store all those possible configurations in memory, and moving them around inside conventional processors to calculate if our guess is right takes a very, very long time. With only 500 switches, there isn’t enough time in the Universe to check all the configurations.

Quantum mechanics can give us a helping hand with this problem. The fundamental power of a quantum computer comes from the idea that you can put bits of information into a superposition of states. Which means that using a quantum computer, our light switches can be ON and OFF at the same time:

Now lets consider the same bunch of switches as before, but now held in a quantum computer’s memory:

Because all the light switches are on and off at the same time, we know that the correct answer (correct ON/OFF settings for each switch) is represented in there somewhere… it is just currently hidden from us.

What the D-Wave quantum computer allows you to do is take this ‘quantum representation’ of your switches and extract the configuration of ONs and OFFs with the lowest value.
Here’s how you do this:

You start with the system in its quantum superposition as described above, and you slowly adjust the quantum computer to turn off the quantum superposition effect. At the same time, you slowly turn up all those bias values (the h and J’s from earlier). As this is performed, you allow the switches to slowly drop out of the superposition and choose one classical state, either ON or OFF. At the end, each switch MUST have chosen to be either ON or OFF. The quantum mechanics working inside the computer helps the light switches settle into the right states to give the lowest overall value when you add them all up at the end. Even though there are 2^N possible configurations it could have ended up in, it finds the lowest one, winning the light switch game.

# The Developer Portal

Keen-eyed readers may have noticed a new section on the D-Wave website entitled ‘developer portal’. Currently the devPortal is being tested within D-Wave, however we are hoping to open it up to many developers in a staged way within the next year.

We’ve been getting a fair amount of interest from developers around the world already, and we’re anxious to open up the portal so that everyone can have access to the tools needed to start programming quantum computers! However given that this way of programming is so new we are also cautious about carefully testing everything before doing so. In short, it is coming, but you will have to wait just a little longer to get access!

A few tutorials are already available for everyone on the portal. These are intended to give a simple background to programming the quantum systems in advance of the tools coming online. New tutorials will be added to this list over time. If you’d like to have a look you can find them here: DEVELOPER TUTORIALS

In the future we hope that we will be able to grow the community to include competitions and prizes, programming challenges, and large open source projects for people who are itching to make a contribution to the fun world of quantum computer programming.

# Learning to program the D-Wave One: loss function, regularization and some notation

The Loss Function

A loss function is a measure of how well a choice for a set of weak classifiers performs. Let’s start by quantifying how we encode a particular combination of weak classifiers. Given a dictionary with D weak classifiers, define a D-dimensional vector of binary numbers $\vec{w}$.

The elements of $\vec{w}$ that have value 1 represent weak classifiers that are ‘turned on’, and the elements that have value 0 represent weak classifiers that are ‘turned off’. For the choice shown in the figure in the previous post, the vector $\vec{w}$ would have zeroes for all elements except 2, 4, 14, 33, 39 and 45, whose values would be 1. Here are three examples, for the D=52 dictionary defined in Algorithm 2:

$\vec{w} = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]$

This would represent the choice where no weak classifiers were included.

$\vec{w} = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]$

This would represent the choice where all 52 weak classifiers were included.

$\vec{w} = [0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0]$

This would represent the choice where weak classifiers 2, 4, 14, 33, 39 and 45 were included, and all the others were turned off.

For any item $x_s$ in our training set, each choice for $\vec{w}$ will produce a prediction for what the label should be. We can write this prediction as ${\cal F}(x_s) = sign \left[ \sum_{j=1}^D w_j F_j(x_s) \right]$ . Since all of our training data comes with labels $y_s$, we can compare this prediction to the actual label. The way we will do this is that if the prediction is correct, we want the function that compares them to return zero, and if the prediction is wrong, we want it to return 1. If we then sum over all the elements in the training data, we get

$L_0(\vec{w}) = {{1}\over{4}} \sum_{s=1}^S \left( sign \left[ \sum_{j=1}^D w_j F_j(x_s)\right] -y_s \right)^2$

The function $L_0(\vec{w})$  is called the zero-one loss and simply counts the number of errors that a particular choice for $\vec{w}$ made – if it got every single member of the training set right,  $L_0(\vec{w})$ will be zero. If it got them all wrong, $L_0(\vec{w})$ will be equal to S, the total number of elements in the training set.

In the QBC algorithm, it will be necessary to only include terms in the loss function that are either linear or quadratic in the $\vec{w}$  variables, because that’s the format that D-Wave hardware natively accepts. Because the $L_0(\vec{w})$ function has a ‘sign’ function in it, we won’t be able to use this exact functional form. Instead, we will use a related form called the quadratic loss.

The quadratic loss has the form

$L_1(\vec{w}) = \sum_{s=1}^S \left[ {{1}\over{D}} \sum_{j=1}^D w_j F_j(x_s) - y_s \right]^2$

Comparing this to the previous loss function, we see that the main difference is that we have removed the ‘sign’ function and replaced it with a normalized sum.

Having to convert the loss function from ‘the one we really want’ to ‘one the hardware can handle’ is not ideal, and is one of the drawbacks of the QBC algorithm. However it turns out that the quadratic loss performs well for building good classifiers in many circumstances. In later modules we’ll introduce procedures for expanding the kinds of optimization problems we can run in hardware, but for now we’ll just use the quadratic loss.

Regularization

When building a classifier, it is desirable to have as few weak classifiers included as possible – you want your strong classifier to be as simple as you can make it, while retaining some desired level of performance.

Overly complicated strong classifiers will tend to perform better on your validation set, but more poorly on your test set (and in the real world on examples it hasn’t seen yet). This happens because of a phenomenon known as over-fitting – a strong classifier formed using a large number of weak classifiers can adjust itself to the idiosyncrasies of your training set, and not generalize as well to as yet unseen examples as a simpler model.

You can think of over-fitting as being analogous to “memorization” of the training set features, without “understanding” what it is about these features that is behind the classification. A student who crams for an exam at the last moment by memorizing a large number of specific facts may very do quite well on the exam, because they have the specific facts they need in that situation memorized, but will have trouble applying the underlying concepts to other situations.

Here we will use what is known as zero-norm regularization to fight against over-fitting. This penalizes the inclusion of many weak classifiers.

The regularization term we will use is $R(\vec{w}) = \lambda \sum_{j=1}^D w_j$, where $\lambda$ is an as yet unspecified real number.

Ranking Performance: The Optimization Objective Function

Now that we have a loss function and a regularization term, we’re ready to define a procedure for ranking the performance of combinations of weak classifiers. Here we’ll try to introduce the basic concepts. We’ll formalize the procedure in a later post.

In order to find the best possible set of weak classifiers to form our strong classifier, we construct a function that is the sum of the loss function and the regularization term:

$E(\vec{w}) = L_1(\vec{w}) + R(\vec{w}) = \sum_{s=1}^S \left[ {{1}\over{D}} \sum_{j=1}^D w_j F_j(x_s) - y_s \right]^2 + \lambda \sum_{j=1}^D w_j$

We’ll refer to $E(\vec{w})$ as the optimization objective function. For each choice of $\vec{w}$,  $E(\vec{w})$ returns a number. The first term in  $E(\vec{w})$ — the loss function term — returns numbers that are lower the closer $\vec{w}$ gets to correctly labeling all elements of the training set. The second term in  $E(\vec{w})$ — the regularization term — just counts the number of ones in $\vec{w}$ , multiplied by some number $\lambda$.

Now let’s see how we can use $E(\vec{w})$ to do something useful. First, let’s find the settings of $\vec{w}$ that return the lowest possible value of $E(\vec{w})$ for some particular choice of $\lambda$ (say $\lambda=0$). We write this as

$\vec{w}^* = arg min_{\vec{w}} E(\vec{w}, \lambda = 0)$

The ‘arg min’ notation simply means ‘return the value of $\vec{w}$  that minimizes $E(\vec{w}, \lambda = 0)$’, and $\vec{w}^*$ is that value.

Alright so what is $\vec{w}^*$ ? It is just the list of weak classifiers that, when considered as a group, ‘perform best’ (meaning minimize the optimization objective function) for the choice of $\lambda$ we made. In the case where we set $\lambda=0$, what ‘performs best’ means is that the set of weak classifiers given by $\vec{w}^*$ has the lowest possible number of mismatches between predicted labels and actual labels for our training data.

We now have a possible candidate for our final strong classifier. Let’s call it $\vec{w}^* (\lambda = 0)$ to remember that it’s the result we found when using $\lambda = 0$.

If we repeat this process for many different values of $\lambda$ , the minimizer $\vec{w}^* (\lambda )$ will change. If we look at the limit where $\lambda$ is very large, the cost of including even a single weak classifier is prohibitive and $\vec{w}^*$ will become the vector of all zeroes.

So for every value we set $\lambda$ to, we get a candidate suggestion for a strong classifier. How can we select which of these might work best in the real world? The solutions returned for $\lambda \sim 0$ don’t take advantage of regularization, so you might expect these would suffer from over-fitting – the vector $\vec{w}^* (\lambda )$ would have too many ones in it (i.e. too many weak classifiers). The solutions returned for very large $\lambda$ will probably be too sparse to work well. Somewhere in between these will be the best choice.

In order to find what this best choice is, we now turn to our validation set. Recall that we partitioned our original data set into three groups – training, validation and test – and so far we’ve just used the training set.

For every value of $\lambda$ we tried, we got a solution $\vec{w}^* (\lambda )$ . We now rank these using the following procedure. For each solution $\vec{w}^* (\lambda )$  we compute the validation error

$E_v(\lambda) = {{1}\over{4}} \sum_{v=1}^V \left[ sign \left[ \sum_{j=1}^D w_j^*(\lambda) F_j(x_v) \right] - y_s \right]^2$

The validation error simply counts the number of elements of the validation set a strong classifier candidate mislabels. We can then rank the vectors $\vec{w}^* (\lambda )$ from lowest validation error (i.e. perform the best on the validation set) to highest validation error (perform the worst on the validation set).

We’d expect that $\vec{w}^* (\lambda )$ where $\lambda$ is either too high or too low would give higher validation errors, and somewhere in the middle is the lowest one. That’s the vector we want to output.

Once we have the value of  $\vec{w}^* (\lambda)$ that produces the smallest validation error, we’re done! That value encodes the final strong classifier that is the output of the QBC algorithm. It is the best possible strong classifier we could have made, given our choices for training and validation data and our dictionary of weak classifiers.

Notation: Massaging the Optimization Objective Function

In what follows it will be useful to have a couple of different ways of writing down our optimization objective function. In particular when we introduce the hardware, we usually write the form of the optimization problem in a different way that’s closer to the way we specify the machine language parameters of a chip.

The form we introduced in the previous section was

$E(\vec{w}) = \sum_{s=1}^S \left[ {{1}\over{D}} \sum_{j=1}^D w_j F_j(x_s) - y_s \right]^2 + \lambda \sum_{j=1}^D w_j$

One thing we should do here is expand out the squared term in brackets and throw away all of the terms that aren’t functions of $\vec{w}$ (these constants don’t affect the returned optimal value $\vec{w}^* (\lambda)$  so we can ignore them). When we do this we can write

$E(\vec{w}) = \sum_{j=1}^D Q_{jj} w_j + \sum_{i

where $Q_{jj} = {{D \lambda}\over{2S}} + {{1}\over{2D}} -{{1}\over{S}} \sum_{s=1}^S y_s F_j(x_s)$ and $Q_{ij} = {{1}\over{DS}} \sum_{s=1}^S F_i(x_s) F_j(x_s)$.  Note that we multiplied all terms by a constant scaling factor  ${{D}\over{2S}}$ just so that it’s easier to keep track of the ranges of the terms in the objective function.

We are going to refer to this way of writing the optimization objective function as the QUBO representation. QUBO stands for Quadratic Unconstrained Binary Optimization.

You should convince yourself that multiplying out the original optimization objective function and collecting terms gives the expression above.

For the purposes of coding up the QBC algorithm, the QUBO representation is all we will need. However as we’ll see in the next post, a slightly different way of looking at the problem is more natural for understanding what the hardware is doing.

If we make a change of variables from 0/1 binary variables $\vec{w}$  to -1/+1 ‘spin’ variables $\vec{z}$  by substituting $w_j = {{1+z_j}\over{2}}$  into our QUBO representation, we can write our optimization objective function as

$E(\vec{z}) = \sum_{j=1}^D Q_{jj} \left[ {{1+z_j}\over{2}} \right] + \sum_{i

If we expand everything out, collect terms and throw away everything that doesn’t depend on the new optimization variables, we get

$E(\vec{z}) = \sum_{j=1}^D h_j z_j + \sum_{i

where  $h_j = {{Q_{jj}}\over{2}} + \sum_{i=1}^D {{Q_{ij}}\over{4}}$ and  $J_{ij} = {{Q_{ij}}\over{4}}$. We’ll refer to this version as the Ising representation, and we’ll see why this is a useful way to look at the problem in the next post.

# Learning to program the D-Wave One: Introduction to binary classification

Imagine you are given the task of creating code that will input some (potentially very complex) object, and automatically label it with one of two possible labels. Here are some examples:

• Given an image, is there an apple in the image or not?
• Given a chest x-ray, is there a tumor in the image or not?
• Is a piece of code malware or not?
• Given a movie review, is it positive or negative?
• Given only a person’s first name, is the person more likely to be male or female?

How can we go about building the software that will apply these labels? There are many ways that you could try to do this. One of these, which we’ll focus on here, is a basic machine learning approach called supervised binary classification.

This approach requires you to prepare two (and only two!) things before you can implement a classifier. These are labeled data and a set of weak classifiers. In this post I’ll focus on what labeled data is, how to get it ready for our procedure, and places where you can find very cool curated data sets.

Labeled Data

If we want a machine to be able to learn to recognize something, one way to proceed is to present the machine with large numbers of objects that have been labeled (usually by humans) to either include the thing we are looking for or not, together with a prescription allowing the machine to learn what features in the object correlate with these labels.

For example, if we want to build code to detect whether an apple is in an image, we could generate a large set of images, all with apples in them, and label these images with the word “apple”. We could also generate a large set of images, none of which had apples in them, and label each of these “no apple”. A good machine learning algorithm might then detect a correlation between having red in an image, and having an “apple” label. The algorithm could learn that images that contain red things are more likely to contain apples, based on the examples we’ve shown it.

This type of approach, which depends on having large numbers of properly labeled examples, is called supervised machine learning.

The notation we will use is that the potentially very complex object, to which the label is assigned, is denoted x . x can be any representation of the object that you feel makes sense, given the type of object you are dealing with. Sometimes it makes sense to think of x as a vector of real numbers. For example, if you are dealing with images, x could be the values of the pixels, or the values of color histograms, or the coefficients of some type of wavelet transform. If you are dealing with natural language, as we will be in the example we’ll build here, x is often a string.

The label that has been assigned to that object we will denote y . We will use the convention that $y = \pm 1$, with these two different labels referring to the two different possibilities we are looking for our binary classifier to select.

Every item in our labeled data will be a pair of the form (x, y). The total number of items we have access to we’ll call N . If we want to build a system that learns to classify the most likely gender of a person‘s first name, labeled data could look like this:

$(x_1,y_1)$ = (‘Mohammad’,-1)

$(x_2,y_2)$ = (‘Suzanne’,+1)

$(x_3,y_3)$ = (‘Geordie’,-1)

.

.

.

$(x_N,y_N)$ = (‘Erin’,+1)

In this example the x variables are strings holding a name, and the y values hold the most likely gender for that name, encoded so that +1 means “female” and -1 means “male”.

Note that while this example looks fairly simple, in general you can put anything you like in the x slot. If you were building a binary classifier that labeled a novel as to whether it was written by Dean Koontz or not, you might put the entire text of the novel in this slot. If you were interested in labeling images, x could be the pixel values in an image.

Once we have access to this labeled data, we need to perform a set of simple operations on it in order to proceed. These operations separate the available data into three separate groups, which are called the training set, validation set and test set.

The training set is the set of data that you will use to build your classifier; it is the subset of your data that the algorithm uses to learn the difference between objects labeled +1 and objects labeled -1.

The validation set is the set of data that you will use to gauge the performance of a potential classifier, while your training algorithm is running.

The test set is the set of data that you will use to test how well the best classifier you found could be expected to perform “in real life”, on examples it hasn’t seen yet.

There are procedures that have been developed for how to best slice your available labeled data into these categories. In the example we’ll implement here, we won’t use these (but if you’re coming at this with a machine learning background, you should!). If you are going to build industrial strength classifiers you will need to take some care in how you segment your available data, and you will encounter issues that need some thought to resolve. If you would like me to post anything about these let me know in the comments section.

Because we‘re focusing on showing the basic principles here, we‘ll just use a very simple procedure for segmenting our data. Here is how it works:

The first two steps ensure that the data we are keeping has equal numbers of +1 and -1 examples. Note that for this to work exactly as laid out K has to be even and R has to be divisible by four. Segmenting everything exactly like this isn‘t necessary. You are free to choose the sizes of your test, training and validation sets to be anything you like.

In our implementation, K was 2,942 (this was the number of male names in the NLTK “names” corpus, which is considerably smaller than the number of female names), and R was chosen to be 100. For these choices, the training set contains S = 2,892 elements (half labeled +1 and half labeled -1), the validation set contains V = 2,892 elements (half labeled +1 and half labeled -1), and the test set contains R = 100 elements (half labeled +1 and half labeled -1).

Here are links to a Python function that implements Algorithm 1 (note: I had to change the file extension to .doc as wordpress doesn’t like .py. Change the extension to .py)  as well as the training_set, validation_set and test_set (all in cPickled form — note again change extension to .txt, wordpress doesn’t like .txt either?) we’re going to use to build our classifier.

Question:: anyone have a better way of sharing source code on wordpress?

Try it! At this point it makes sense to go over the source code for Algorithm 1 linked to above, and understand what it’s doing. If your Python is rusty, or you’ve never used NLTK, this is a good place to get your mojo going. Ask me questions if something doesn’t work!!!