In my last post, I set up an optimization problem that we need to solve to do a particularly interesting kind of sparse coding. Solving this problem requires several steps. Some of these are easy and some are not. In this post I’ll tease out the core hard bit, and show that it’s of the kind that D-Wave hardware solves, which goes by several different names, including Quadratic Unconstrained Binary Optimization (QUBO), the Ising model, and weighted max-2-sat.
Warning: mathematics ahead is OT Level VII. It does get more gnarly than this but not by much.
Here is the optimization problem we came up with in the previous post.
L0-norm sparse coding
- A set of data objects , where each is a real valued vector with components;
- An real valued matrix , where is the number of dictionary atoms we choose, and we define its column to be the vector ;
- A binary valued matrix ;
- And a real number , which is called the regularization parameter,
Find and that minimize
L1-norm sparse coding
The L0 version of sparse coding differs from the way most people do sparse coding. Standard sparse coding assumes the weights are real numbers, and the optimization problem is:
Find and that minimize
This looks very similar. The only difference is that all the entries in are real, and the regularization term is now a sum over the absolute values of those entries. The real weight version is called L1-norm sparse coding, and the binary weight version is called L0-norm sparse coding.
You can think of the L0 version as only being able to either add or not add dictionary atoms to form a reconstruction, whereas the L1 version can add arbitrary “amounts” of the atoms.
Both the L0- and L1-norm sparse coding versions can be solved using the same procedure. Here it is.
Solving the sparse coding problem using block coordinate descent
- First, we fix the weights to some random initial configurations.
- Assuming these fixed weights, we solve the optimization problem for the dictionary atoms .
- Now we fix the dictionary atoms to these values, and find the optimal weights .
We then iterate steps 2 and 3 until converges to a minimum, which will be a global minimum for L1 and a local minimum for L0. Note that this basic procedure works for both versions. In step 1, the only difference is that the initial configurations have to be either real or binary. In step 2, the optimization over real valued is identical for both versions — the algorithm we use is the one described here for finding the dictionary atoms.This algorithm is fast. It’s in step 3 where we find a very big difference in the difficulty of finding the optimal weights, given a dictionary.
Finding optimal weights, given a fixed dictionary
In the case where the weights are real, and the regularization is of the L1 form (sum over absolute values), we use the Feature Sign Search (FSS) algorithm described here. This algorithm is quite fast, and it has to be, because there are a total of of these that need to be solved to complete the step, and can be very large. An interesting observation is that all of these optimization problems are independent, and can therefore this can be efficiently parallellized. In order to perform this parallellization, we use the PiCloud python libraries, which allows us to run hundreds or thousands of parallel jobs to perform the optimization over the weights. As a rough estimate, for the optimization problems generated by MNIST, each optimization using FSS takes about 30 milliseconds, and there are 60,000 of these per iteration of the block descent procedure. If we run serially this is about 30 minutes per iteration. If we use 100 cores, we can send 600 jobs to each core, and get about 100x speed-up, taking the time down to about 20 seconds.
As an interesting aside, we find that our own Python implementation of FSS is about the same in terms of performance as the original MATLAB code provided by Honglak Lee. This was a little surprising as the core computations run in highly optimized compiled code inside MATLAB. This is evidence that the routines within numpy are competitive with MATLAB’s versions for the core FSS computations.
Now if we shift over to the L0 version, we have a different kind of optimization problem to solve. When the weights are bits, the problem we need to solve is NP-hard, and to do anything practical we’re going to need to use a heuristic solver. As in the case of the L1 version, we also get independent problems. So let’s just focus on one of these. If the dictionary is fixed, the problem we need to solve is:
Find that minimizes
Expanding this out, and dropping constant terms, this can be rewritten as
This is a QUBO.
Solving these QUBOs
There are several potential solver options we could use here, including running in D-Wave hardware. D-Wave processors are designed to solve QUBOs, and can natively solve the types of optimization problem described here. There are lots of other options also. Of all the conventional options we looked at, the solver that worked best was based on the tabu search algorithm.
So what we’ve accomplished here is to reduce the key step in the L0-norm sparse coding procedure to a set of QUBOs. In the next post, I’ll show some preliminary results of the comparative performance of the L0 and L1 versions on MNIST, show dictionaries generated by both, and introduce a variant that makes more direct use of D-Wave hardware.