Sparse coding on D-Wave hardware: finding weights is a QUBO

In my last post, I set up an optimization problem that we need to solve to do a particularly interesting kind of sparse coding. Solving this problem requires several steps. Some of these are easy and some are not. In this post I’ll tease out the core hard bit, and show that it’s of the kind that D-Wave hardware solves, which goes by several different names, including Quadratic Unconstrained Binary Optimization (QUBO), the Ising model, and weighted max-2-sat.

Warning: mathematics ahead is OT Level VII. It does get more gnarly than this but not by much.

.

Here is the optimization problem we came up with in the previous post.

L0-norm sparse coding

Given

1. A set of $S$ data objects $\vec{z}_s$, where each $\vec{z}_s$ is a real valued vector with $N$ components;
2. An $N x K$ real valued matrix $\hat{D}$, where $K$ is the number of dictionary atoms we choose, and we define its $k^{th}$ column to be the vector $\vec{d}_k$;
3. A $K x S$ binary valued matrix $\hat{W}$;
4. And a real number $\lambda$, which is called the regularization parameter,

Find $\hat{W}$ and $\hat{D}$ that minimize

$G(\hat{W}, \hat{D} ; \lambda) = \sum_{s=1}^S || \vec{z}_{s} - \sum_{k=1}^{K} w_{ks} \vec{d}_k ||^2 + \lambda \sum_{s=1}^S \sum_{k=1}^{K} w_{ks}$

L1-norm sparse coding

The L0 version of sparse coding differs from the way most people do sparse coding. Standard sparse coding assumes the weights $\hat{W}$ are real numbers, and the optimization problem is:

Find $\hat{W}$ and $\hat{D}$ that minimize

$G(\hat{W}, \hat{D} ; \lambda) = \sum_{s=1}^S || \vec{z}_{s} - \sum_{k=1}^{K} w_{ks} \vec{d}_k ||^2 + \lambda \sum_{s=1}^S \sum_{k=1}^{K} |w_{ks}|$

This looks very similar. The only difference is that all the entries in $\hat{W}$ are real, and the regularization term is now a sum over the absolute values of those entries. The real weight version is called L1-norm sparse coding, and the binary weight version is called L0-norm sparse coding.

You can think of the L0 version as only being able to either add or not add dictionary atoms to form a reconstruction, whereas the L1 version can add arbitrary “amounts” of the atoms.

Both the L0- and L1-norm sparse coding versions can be solved using the same procedure. Here it is.

Solving the sparse coding problem using block coordinate descent

1. First, we fix the weights $\hat{W}$ to some random initial configurations.
2. Assuming these fixed weights, we solve the optimization problem for the dictionary atoms $\hat{D}$.
3. Now we fix the dictionary atoms to these values, and find the optimal weights $\hat{W}$.

We then iterate steps 2 and 3 until $G$ converges to a minimum, which will be a global minimum for L1 and a local minimum for L0. Note that this basic procedure works for both versions. In step 1, the only difference is that the initial configurations have to be either real or binary. In step 2, the optimization over real valued $\hat{D}$ is identical for both versions — the algorithm we use is the one described here for finding the dictionary atoms.This algorithm is fast. It’s in step 3 where we find a very big difference in the difficulty of finding the optimal weights, given a dictionary.

Finding optimal weights, given a fixed dictionary

In the case where the weights are real, and the regularization is of the L1 form (sum over absolute values), we use the Feature Sign Search (FSS) algorithm described here. This algorithm is quite fast, and it has to be, because there are a total of $S$ of these that need to be solved to complete the step, and $S$ can be very large. An interesting observation is that all of these optimization problems are independent, and can therefore this can be efficiently parallellized. In order to perform this parallellization, we use the PiCloud python libraries, which allows us to run hundreds or thousands of parallel jobs to perform the optimization over the weights. As a rough estimate, for the optimization problems generated by MNIST, each optimization using FSS takes about 30 milliseconds, and there are 60,000 of these per iteration of the block descent procedure. If we run serially this is about 30 minutes per iteration. If we use 100 cores, we can send 600 jobs to each core, and get about 100x speed-up, taking the time down to about 20 seconds.

As an interesting aside, we find that our own Python implementation of FSS is about the same in terms of performance as the original MATLAB code provided by Honglak Lee. This was a little surprising as the core computations run in highly optimized compiled code inside MATLAB. This is evidence that the routines within numpy are competitive with MATLAB’s versions for the core FSS computations.

Now if we shift over to the L0 version, we have a different kind of optimization problem to solve. When the weights are bits, the problem we need to solve is NP-hard, and to do anything practical we’re going to need to use a heuristic solver. As in the case of the L1 version, we also get $S$ independent problems. So let’s just focus on one of these. If the dictionary is fixed, the problem we need to solve is:

Find $\vec{w}$ that minimizes

$G(\vec{w}; \lambda) = || \vec{z} - \sum_{k=1}^{K} w_{k} \vec{d}_k ||^2 + \lambda \sum_{k=1}^{K} w_{k}$

Expanding this out, and dropping constant terms, this can be rewritten as

$G(\vec{w}; \lambda) = \sum_{k=1}^{K} w_k [ \lambda + \vec{d}_k \cdot (\vec{d}_k -2 \vec{z}) ] + 2 \sum_{j \leq m}^K w_j w_m \vec{d}_j \cdot \vec{d}_m$

This is a QUBO.

Solving these QUBOs

This is Qubert. QUBOs are different.

There are several potential solver options we could use here, including running in D-Wave hardware. D-Wave processors are designed to solve QUBOs, and can natively solve the types of optimization problem described here. There are lots of other options also. Of all the conventional options we looked at, the solver that worked best was based on the tabu search algorithm.

So what we’ve accomplished here is to reduce the key step in the L0-norm sparse coding procedure to a set of QUBOs. In the next post, I’ll show some preliminary results of the comparative performance of the L0 and L1 versions on MNIST, show dictionaries generated by both, and introduce a variant that makes more direct use of D-Wave hardware.

7 thoughts on “Sparse coding on D-Wave hardware: finding weights is a QUBO”

1. I think I’m going to need a D-Wave chip implanted in my brain to be able to comprehend this.

2. If you can program a system such that the minimum of the two variables is the lowest state of stable equilibrium, and then “shake” it with gradually reducing amplitude, it will settle in that state and you can read off the variables.

If you are presented with a beaker of sand and want to level it, you can either use a pair of tweezers and move each grain or do the job much quicker by shaking it with gradually reducing amplitude.

Look at http://www.alecharleyreeves.com and select “The Equilibrium Process”. Click on the link near the end and get to the actual lecture.

Is this actually how the D-Wave works?

• Hi John! Not exactly. What you’re describing is related to classical thermal annealing, where you attempt to piggyback on nature’s desire to drive complex systems into thermal equilibrium with their environments (the second law of thermodynamics). Our systems are based on the same premise but instead of annealing temperature they anneal a different parameter which is inherently quantum mechanical. You can think of this parameter as going from having all results of a computation held in superposition at the beginning of a computation, to the final result where only one result is returned. The analogous thing in thermal annealing is to go from high temperature (all answers equally likely) to zero temperature (the best answer has 100% probability if while decreasing the temperature you can always keep the system in thermal equilibrium).

• Many thanks for that. The original “equilibrium coder” (in the lecture) used tunnel diodes, although I think Josephson Junctions were considered, but it was a long time ago.

3. Hello Geordie. I was just wanting to make sure. If you ran this final QUBO for this sparse coding problem, you would have to run it S times it seems. It seems like this would affect the efficiency a lot because of the initial overhead for programming the new weights for each z_k. Is there any way around that, or is the overhead really not as much as I’m thinking it to be?

Another just-to-make-sure thing. For your final Q, I got the diagonals to look like (lambda – 2zd), and the off-diagonals are simply d_i d_j, and you dropped the z-squared term. Is this correct? Sometimes math is different in my head than in reality.

Big fan of sparse coding, big fan of you and D-Wave, and it’s really awesome that you guys are working on this. And of course, thanks for sharing.