# Progress on solving the constrained optimization problem for the dictionary

It’s a strange thing, but adding constraints to a problem often helps solve it. I’ve seen this happen many times. I think the reason is that if there are too many options, it’s very hard to focus on a very specific objective, and without this kind of specificity it’s very hard to make progress.

Interestingly, it seems as if restricting choices makes people happier also. You should watch this TED talk, it is fascinating.

Also you should watch this to fully appreciate the cosmic horror of freedom.

.

Constraints! Horrible, horrible constraints!

In my last post I introduced a problem that arises when we structure the dictionary so that the QUBOs sparse coding generates have some desired structure. In particular we want to constrain the QUBOs to have the specific connectivity of some given D-Wave processor, so that we don’t have to do any contortions to map problems into hardware. In general, this idea of building algorithms subject to them having to use the hardware in its native mode is a great idea. Endorsed by space ant colonies everywhere.

Unfortunately in this case, when we do this one of the other steps in the sparse coding procedure goes from easy to hard. Here is the new, hard problem that we need to solve.

Finding an optimal structured dictionary given fixed weights

Given

1. An $N x S$ real valued matrix $\hat{Z}$, where $S$ is the number of data objects, and we define the $s^{th}$ column to be the $s^{th}$ data object $\vec{z}_s$, where each $\vec{z}_s$ is a real valued vector with $N$ components, and the matrix elements of $\hat{Z}$ are $z_{ns}$;
2. An $N x K$ real valued matrix $\hat{D}$, where $K$ is the number of dictionary atoms we choose, and we define its $k^{th}$ column to be the vector $\vec{d}_k$, and the matrix elements of $\hat{D}$ are $d_{nk}$;
3. And a $K x S$ binary valued matrix $\hat{W}$ with matrix elements $w_{ks}$;

Find $\hat{D}$ that minimizes

$G^{*}(\hat{D}) = \sum_{s=1}^S || \vec{z}_{s} - \sum_{k=1}^{K} w_{ks} \vec{d}_k ||^2$

${=\sum_{s=1}^S \sum_{n=1}^N (z_{ns}-\sum_{k=1}^{K} w_{ks} d_{nk})^2}$

${=||\hat{Z}-\hat{D} \hat{W}||^2=Tr(\hat{A}^T \hat{A})}$

where $\hat{A}=\hat{Z}-\hat{D} \hat{W}$, subject to the constraints that $\vec{d}_j \cdot \vec{d}_m = 0$ for all pairs $j,m$ that are not connected in the quantum chip being used.

Here is what I’ve implemented

The structured dictionaries idea was invented by Bill Macready, who runs software & algorithms at D-Wave. When he originally implemented this idea (back in 2009), the strategy he used to solve the dictionary optimization problem had four components, which he ran serially. The four components were:

1. Use the Cayley transform;
2. find optimal scaling of the dictionary columns;
3. do a Householder reflection;
4. and optimize each column in series, fixing the values of all the others during the optimization.

Let’s take a closer look at each of these procedures.

The Cayley transform

The Cayley transform is a very interesting technique that allows optimization of the elements of a matrix (it finds local minima which isn’t great, but it’s a hard problem) while preserving orthogonality constraints between columns of the matrix — just what we need!

In response to my last post, when I asked people for ideas about how to solve this problem, Slartibartfastibast referred me to a paper that is posted here. Everything you ever wanted to know about the Cayley transform and how it applies to our case can be found there. This resource was really helpful. Thanks Slart. While my implementation was based on Bill’s earlier work, the core ideas were the same as these and helped me quite a bit.

Finding the optimal scaling of the dictionary atoms

Given an input dictionary, this operation asks for the optimal norm of each column of the dictionary — basically we multiply each column of the input dictionary by a real variable, and then optimize over all those variables. It turns out this simple operation really helps quite a bit.

The Householder transform

In the comments section of my last post, Alireza Nejati wrote:

Basically, if you take the D matrix and premultiply it by a regular square orthonormal matrix (of size N), you get a matrix that still satisfies the constraints on D (i.e. orthogonality on columns that are not connected in the quantum machine). You can check this. Let d_i and d_j be two columns of D. Let their dot product d_i^T d_j be equal to x. Then since Q is orthonormal, it preserves dot products, and so if x is 0, the dot product of Qd_i and Qd_j is going to be zero, and if their dot product is nonzero, their dot product when premultiplied by Q is going to be nonzero.

The point to all of this is that QD is a valid dictionary matrix and, further, any valid dictionary matrix can be generated in this way. So this is the parameterization you are looking for in strategy 2. So now the problem becomes minimizing ||Z-QY||^2 where Y=DW is given, and we are now simply optimizing over Q, the set of all orthonormal matrices, which have further nice parameterizations as rotations or reflections in N-dimensional space.

This is all true! Thanks Alireza! and was one of the operations Bill had in his original implementation. The operation Bill used finds the optimal Householder transformation. Interestingly what I find is that this is not very helpful in the optimization — when I do it, it rarely reduces the objective function by much. There are other parametrizations of these types of transform. One of these was suggested by Alireja in the comments section. I’m not sure that suggestion will work well, but it is probably worth trying!

Single column optimizations

If you fix all columns in the dictionary, and allow all the elements in a single column to be variables, you can optimize these by computing the nullspace basis of the column and then optimizing within that space. You can do this column-wise, and reduce the value of the objective function by doing this. These are very ‘local’ adjustments, and tend to provide a small amount of benefit.

That’s all I’ve got in this right now. Next I’m going to repeat some of my early analyses including the structured dictionaries approach. Exciting!

## 6 thoughts on “Progress on solving the constrained optimization problem for the dictionary”

1. Will this make it possible to perform certain kinds of calculations exponentially faster than a classical computer which has been optimized to perform the same kind of calculation, provided the D-Wave quantum annealer has sufficient qubits?

• Hi Mark! No-one knows, and none of the experimental results obtained yet really answer this question as the run-time scaling observed for our hardware is still dominated by mis-specification errors in setting the machine language parameters (basically you try to program in a problem and what ‘actually’ gets programmed is slightly different, and so the machine is solving a different problem than the one you wanted). We have upper bounds now on runtime scaling but really have no idea what the “true” scaling of quantum annealing is (if this even makes sense, as quantum annealing is a family of algorithms with a ton of parameters, such as how you embed a problem in hardware). The interesting thing is that the answer that’s the most consistent with what we’ve observed is that it’s *constant*, i.e. not a function of problem size. Of course it probably isn’t. But wouldn’t that be something if it were.

In order to get to the ‘true’ scaling of quantum annealing we need to make the actual values of the machine language parameters quite a bit closer to the desired values (about 2 orders of magnitude improvement I’d guess) which is pretty straightforward, and be able to anneal quite a bit faster (right now the I/O system restricts how fast signals can be changed which leads to a shortest annealing time of about 5 microseconds, which is way too fast to probe errors in quantum annealing, at least for small-ish size problems).

2. Interesting, in Slart’s links the author touches on using the tangent space for gradient descent (which is essentially the same idea I had in my reply to your previous post) but puts the main focus on the householder transformation method. Perhaps it would be useful to do a rigorous comparison of the two methods to see which one works best in this case.

3. Hi Geordie, I have a general question about the problem: Is it reasonable to assume that the rank of matrix W is at least k ? If it is, the objective function can be written as ||Z W^T (W W^T)^{-1} – D ||^2, which makes the problem a little more explicit – finding k vectors in R^N closest to the given set of vectors and subject to the constraints. With regards to the constraints, since d_i * d_j=0 for disconnected elements, is there a setup where we would be interested in solving a similar problem but with the constraints having the form D^T D = C, where C is a given matrix of connection strengths?Thank you!

• Hi Zhenya! Yes I believe that W will have rank at least K, and I think your reformulation is perfectly fine! As to your second question, in fact the way my implementation works the constraints do have this form. I first find a random matrix D_0 that obeys the \vec{d}_i * \vec{d}_j = 0 orthogonality constraints (by first making all of the columns completely random, then normalizing each column to norm 1, then iterating over all columns by first erasing the contents of each and replacing it with a new random column written in the null-space basis), then I preserve the constraint D^T D = D_0^T D_0 (ie my C = D_0^T D_0) during the optimization. For example, the Cayley transform preserves these constraints.

4. There’s no paradox to choice. The more options you have, the more possibilities you have to consider. The more constraints, the fewer you have to consider.