# Sparse coding on D-Wave hardware: some results

Last week I described two variants of sparse coding. One, which is commonly used, attempts to find a dictionary of atoms which can be used to sparsely reconstruct data, where the reconstructions are linear combinations of these atoms multiplied by real numbers, and the regularization term has the L1-norm form. We called this L1-norm sparse coding. I described the way we solve the optimization problem underlying this method.

The second approach is generally not used, and I believe very little is known about it. The procedure is identical to the usual approach except for one small change. Instead of real numbers, we attempt to reconstruct the data using a linear combination of dictionary atoms where the weights are binary, and we use L0-norm regularization. We called this L0-norm sparse coding.

To solve the L0-norm sparse coding optimization problem, we use the same procedure — called block coordinate descent — that was successfully used for the L1 version. The difference is that unlike the L1 version, where the step where optimization over the weights can be done efficiently, in the L0 version that optimization is NP-hard. The problem separates into $S$ independent problems of finding the optimal weights for each of the $S$ objects in our data set. These problems are of the form

Find $\vec{w}$ that minimizes

$G(\vec{w}; \lambda) = \sum_{j=1}^{K} w_j [ \lambda + \vec{d}_j \cdot (\vec{d}_j -2 \vec{z}) ] + 2 \sum_{j \leq m}^K w_j w_m \vec{d}_j \cdot \vec{d}_m$

which is a QUBO. So to perform the optimization over the weights, we need to solve $S$ independent QUBOs at each iteration of the block descent. There are many ways to do this, including using D-Wave hardware. The results I’ll show below were obtained using a variant of tabu search.

Here I’m going to show some interesting results on some differences between these two approaches on MNIST.

Some initial results

Sparse coding attempts to minimize the reconstruction error, which is the total summed difference between the initial data — the ground truth — and the system’s reconstructions of this data, using only linear combinations of the dictionary atoms. It does this subject to a condition which penalizes including too many of these atoms in any given reconstruction — this is the regularization term.

As the strength of the regularization increases — which means $\lambda$ is made bigger — fewer and fewer atoms can be used to reconstruct each data object. Here we define the average sparsity to be the average number of dictionary atoms used in the reconstructions.

For many reasons, it is desirable to have as low an average sparsity as we can get away with. Now what ‘we can get away with’ is highly context dependent. What we decide this means depends on what we are ultimately attempting to do. One of the ways we can decide what this means is to use these ideas for lossy compression — we want to find a way to represent our data set with as few bits as we can, and are willing to accept some user-defined amount of degradation in our data to achieve this. To do this, we attempt to do the following:

Find the value of $\lambda$ that minimizes average sparsity, subject to the requirement that the reconstruction error be lower than a user supplied threshold.

One of the interesting initial results of comparing the L0 and L1-norm versions of sparse coding is that the L0 version gives much sparser representations for the same reconstruction error. Here is a plot of the total reconstruction error over $S = 60,000$ MNIST training images as a function of the average number of atoms per image used in these reconstructions, for both L1 and L0-norm sparse coding. In this experiment we used $K = 64$ dictionary atoms, and the data objects were representations of the raw MNIST images including the first 30 SVD modes (so each image was represented in its ground truth using 30 real numbers). Each data point corresponds to a different value of $\lambda$, where the points to the left have larger $\lambda$ (more sparse) with $\lambda$ decreasing to the right (less sparse).

This is a very cool result. For the same reconstruction error, the L0 version requires roughly one half the number of atoms. In addition, the L1 version’s weights are real numbers, which require several bits to specify (up to 64, but likely in practice 8 would be enough), whereas by construction the L0 version’s weights are binary. So if our objective is compression, L0 wins both from the perspective of number of atoms required and the number of bits per atom.

If we represent the L1 weights using 8 bits, and use half as many to get the same quality reconstruction, this means that for the figure above L0 achieves about 16x more compression than L1 for the same reconstruction quality.

What a dictionary looks like

Once the sparse coding procedure is complete, we can look at the dictionary atoms. Shown here are the dictionary atoms found during an L0 run with $\lambda = 0.035$, which gave an average sparsity of 2.25 atoms / image. In the figure, the 64 dictionary atoms are shown in the top part. The bottom part shows the ground truth for the first 48 images of the 60,000 total. The middle part shows the reconstructions, and which dictionary atoms were used in these reconstructions. If more than two atoms were used, the notation is >>#Y, where Y is the number of atoms.

Next, let’s look at a close-up showing how the reconstructions work for three of the images. The way to understand the image below is that the atoms (on the right) are added together, and what this addition gives is the reconstruction (second column from left), which we can compare to the ground truth (leftmost column). One technical detail is that prior to doing the sparse coding procedure, we subtracted off the “average” value of the pixels over the entire 60,000 image data set, which gets added back here. So the reconstructions are always ‘average image’ + sum over the atoms in the reconstruction. If you look at the middle one (the nine), the difference between atom #3 and the reconstruction of the nine is the average image.

Next, let’s look at a dictionary obtained from the L1 procedure, with $\lambda = 0.25$ chosen to give roughly the same average sparsity as above (here the average number of atoms / image was 2.16). You can see with your eye that the jump in reconstruction error (from about 4,000 for L0 to about 6,500 for L1 in the first graph above) makes for poor reconstructions. Note that the L1 procedure doesn’t always give poor reconstructions — they can be quite excellent. You just need more atoms on average to get there.

This is great, but where’s the hardware fit in?

There is some debate about whether L0-norm sparse coding actually buys you anything over the L1 version in practice. While anecdotal results such as the above really don’t do much to settle this issue, it is nonetheless encouraging to see that recasting this basic workhorse procedure in this way does seem to provide a significant benefit — at least for compressing MNIST.

Of course the reason we’ve been thinking so hard about this issue is that we want to find problems we can apply quantum computation to. Given that the basic L0-norm sparse coding procedure seems to provide intriguing benefits over the L1 version, the second order question is whether quantum computers can be used to solve the optimization problems underlying this version more effectively than you could otherwise do. This will be the subject of my next post.