Sparse coding on D-Wave hardware: things that don’t work

Ice, ice baby.

Ice, ice baby.

For Christmas this year, my dad bought me a book called Endurance: Shackleton’s Incredible Voyage, by Alfred Lansing.

It is a true story about folks who survive incredible hardship for a long time. You should read it.

Shackleton’s family motto was Fortitudine Vincimus — “by endurance we conquer”. I like this a lot.

On April 22nd, we celebrate the 14th anniversary of the incorporation of D-Wave. Over these past 14 years, nearly everything we’ve tried hasn’t worked. While we haven’t had to eat penguin (yet), and to my knowledge no amputations have been necessary, it hasn’t been a walk in the park. The first ten things you think of always turn out to be dead ends or won’t work for some reason or other.

Here I’m going to share an example of this with the sparse coding problem by describing two things we tried that didn’t work, and why.

Where we got to last time

In the last post, we boiled down the hardness of L0-norm sparse coding to the solution of a large number of QUBOs of the form

Find \vec{w} that minimizes

G(\vec{w}; \lambda) = \sum_{j=1}^{K} w_j [ \lambda + \vec{d}_j \cdot (\vec{d}_j -2 \vec{z}) ] + 2 \sum_{j \leq m}^K w_j w_m \vec{d}_j \cdot \vec{d}_m

I then showed that using this form has advantages (at least for getting a maximally sparse encoding of MNIST) over the more typical L1-norm version of sparse coding.

I also mentioned that we used a variant of tabu search to solve these QUBOs. Here I’m going to outline two strategies we tried to use the hardware to beat tabu that ended up not working.

These QUBOs are fully connected, and the hardware isn’t

The terms in the QUBO that connect variables j and m are proportional to the dot product of the j^{th} and m^{th} dictionary atoms \vec{d}_j and \vec{d}_m. Because we haven’t added any restrictions on what these atoms need to look like, these dot products can all be non-zero (the dictionary atoms don’t need to be, and in general won’t be, orthogonal). This means that the problems generated by the procedure are all fully connected — each variable is influenced by every other variable.

Unfortunately, when you build a physical quantum computing chip, this full connectivity can’t be achieved. The chip you get to work with connects any given variable with only a small number of other variables.

There are two ways we know of to get around the mismatch of the connectivity of a problem we want to solve, and the connectivity of the hardware. The first is called embedding, and the second is by using the hardware to perform a type of large neighborhood local search as a component of a hybrid algorithm we call BlackBox.

Solving problems by embedding

In a quantum computer, qubits are physically connected to only some of the other qubits. In the most recent spin of our design, each qubit is connected to at most 6 other qubits in a specific pattern which we call a Chimera graph. In our first product chip, Rainier, there were 128 qubits. In the current processor, Vesuvius, there are 512.

Chimera graphs are a way to use a regular repeating pattern to tile out a processor. In Rainier, the processor graph was a four by four tiling of an eight qubit unit cell. For Vesuvius, the same unit cell was used, but with an eight by eight tiling.

For a detailed overview of the rationale behind embedding, and how it works in practice for Chimera graphs, see here and here, which discuss embedding into the 128-qubit Rainier graph (Vesuvius is the same, just more qubits).

The short version is that an embedding is a map from the variables of the problem you wish to solve to the physical qubits in a processor, where the map can be one-to-many (each variable can be mapped to many physical qubits). To preserve the problem structure we strongly ‘lock together’ qubits corresponding to the same variable.

ramsey_number_embeddingIn the case of fully connected QUBOs like the ones we have here, it is known that you can always embed a fully connected graph with K vertices into a Chimera graph with (K-1)^2/2 physical qubits — Rainier can embed a fully connected 17 variable graph, while Vesuvius can embed a fully connected 33 variable graph. Shown to the right is an embedding from this paper into Rainier, for solving a problem that computes Ramsey numbers. The processor graph where qubits colored the same represent the same computational variable.

So one way we could use Vesuvius to solve the sparse coding QUBOs is to restrict K to be 33 or less and embed these problems. However this is unsatisfactory for two (related) reasons. The first is that 33 dictionary atoms isn’t enough for what we ultimately want to do (sparse coding on big data sets). The second is that QUBOs generated by the procedure I’ve described are really easy for tabu search at that scale. For problems this small, tabu gives excellent performance with a per problem timeout of about 10 milliseconds (about the same as the runtime for a single problem on Vesuvius), and since it can be run in the cloud, we can take advantage of massive parallellism as well. So even though on a problem by problem basis, Vesuvius is competitive at this scale, when you gang up say 1,000 cores against it, Vesuvius loses (because there aren’t a thousand of them available… yet🙂 ).

So this option, while we can do it, is out. At the stage we’re at now this approach can’t compete with cloud-enabled tabu. Maybe when we have a lot more qubits.

Solving sparse coding QUBOs using BlackBox

BlackBox is an algorithm developed at D-Wave. Here is a high level introduction to how it works. It is designed to solve problems where all we’re given is a black box that converts possible answers to binary optimization problems into real numbers denoting how good those possible answers are. For example, the configuration of an airplane wing could be specified as a bit string, and to know how ‘good’ that configuration was, we might need to actually construct that example and put it in a wind tunnel and measure it. Or maybe just doing a large-scale supercomputer simulation is enough. But the relationship between the settings of the binary variables and the quality of the answer in problems like this is not easily specified in a closed form, like we were able to do with the sparse coding QUBOs.

BlackBox is based on tabu search, but uses the hardware to generate a model of the objective function around each search point that expands possibilities for next moves beyond single bit flips. This modelling and sampling from hardware at each tabu step increases the time per step, but decreases the number of steps required to reach some target value of the objective function. As the cost of evaluating the objective function goes up, the gain in making fewer ‘steps’ by making better moves at each tabu step goes up. However if the objective function can be very quickly evaluated, tabu generally beats BlackBox because it can make many more guesses per unit time because of the additional cost of the BlackBox modeling and hardware sampling step.

BlackBox can be applied to arbitrary sized fully connected QUBOs, and because of this is better than embedding because we lose the restriction to small numbers of dictionary atoms. With BlackBox we can try any size problem and see how it does.

We did this, and unfortunately BlackBox on Vesuvius is not competitive with cloud-enabled tabu search for any of the problem sizes we tried (which were, admittedly, still pretty small — up to 50 variables). I suspect that this will continue to hold, no matter how large these problems get, for the following reasons:

  1. The inherently parallel nature of the sparse coding problem (S independent QUBOs) means that we will always be up against multiple cores vs. a small number of Vesuvius processors. This factor can be significant — for a large problem with millions of data objects, this factor can easily be in the thousands or tens of thousands.
  2. BlackBox is designed for objective functions that are really black boxes, so that there is no obvious way to attack the structure of the problem directly, and where it is very expensive to evaluate the objective function. This is not the case for these problems — they are QUBOs and this means that attacks can be made directly based on this known fact. For these problems, the current version of BlackBox, while it can certainly be used, is not in its sweet spot, and wouldn’t be expected to be competitive with tabu in the cloud.

And this is exactly what we find — BlackBox on Vesuvius is not competitive with tabu on the cloud for any of the problem sizes we tried. Note that there is a small caveat here — it is possible (although I think unlikely) that for very large numbers of atoms (say low thousands) this could change, and BlackBox could start winning. However for both of the reasons listed above I would bet against this.

What to do, what to do

We tried both obvious tactics for using our gear to solve these problems, and both lost to a superior classical approach. So do we give up and go home? Of course not!

We shall go on to the end… we shall never surrender!!!

We just need to do some mental gymnastics here and be creative.

In both of the approaches above, we tried to shoehorn the problem our application generates into the hardware. Neither solution was effective.

So let’s look at this from a different perspective. Is it possible to restrict the problems generated by sparse coding so that they exactly fit in hardware — so that we require the problems generated to exactly match the hardware graph? If we can achieve this, we may be able to beat the classical competition, as we know that Vesuvius is many orders of magnitude faster than anything that exists on earth for the native problems it’s solving.

7 thoughts on “Sparse coding on D-Wave hardware: things that don’t work

  1. Pingback: Sparse coding on D-Wave hardware: finding an optimal structured dictionary | Hack The Multiverse

  2. I have been curious as to why the current (and past models) of the chip have so many “dead” qubits, how they become “die” (engineering error, intentional?), and how these affect embeddings. Is there a plan to have fewer dead qubits in the future?

    • Hi Henri! Each device in the chip needs to be within an ‘acceptable range’ of a variety of parameters as the devices in the chip need to be very close to identical to each other. As each is a manufactured object, there will be variations in these parameters that need to be tuned away. The devices that don;t make the final graph are too far away from the target values of these parameters to be made nominally identical. As an example of this, take the mutual inductance between the qubit and the annealing line. The value of this parameter depends on a variety of things including the thicknesses of layers of insulator and metal around the area where the qubit interacts with the annealing line. When we measure these values you get a distribution M1, M2, …, MN. We then tune some in-situ knobs to try to get them all to be some target value +- a small allowed deviation. If a qubit can’t be tuned into this range it gets dropped. There are about a dozen of these tests that all devices have to pass to make it into the final graph.

  3. Geordi,

    The 128-qubit Rainier was able to prove R(3,3) = 6 and you note that Rainier can embed
    a fully connected 17 variable graph. Given that R(5,5) has not yet been determined with
    any computer, but is known to be between 43 and 49, how many qubits are needed
    to determine R(5,5) with a DWave system? Is it about 45*45*2 = 4,050 or some other value?
    If you had a DWave computer with enough qubits, is there some barrier to finding R(5,5)?
    Like the fact that all qubits aren’t fully connected?


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s