Sparse coding on D-Wave hardware: finding an optimal structured dictionary

I spend most of my time thinking about machine intelligence. I would like to build machines that can think and act like we do. There are many hard problems to solve before we get there.

C. elegans is unsegmented, vermiform, and bilaterally symmetrical, with a cuticle integument, four main epidermal cords and a fluid-filled pseudocoelomate cavity... pfn'glui!!!!

C. elegans is vermiform, with a cuticle integument and a fluid-filled pseudocoelomate cavity. There are two sexes: male and hermaphrodite.

A thing I’ve been thinking about recently is how the limits of cognition matter in understanding cognition itself. Clearly you can have examples where the machinery is not sophisticated enough. For example a C. elegans roundworm is not going to reverse engineer its own neural system any time soon, even though it is very simple (although the descendents of openworm might…).

Openworm is a really interesting idea. I hope it does well and succeeds. Although the idea of the first life forms on the internet being worms, that OBVIOUSLY will grow super intelligent and take over the universe and consume all its atoms making bigger and bigger Harlem Shake videos, is a little off-putting from a human perspective. De Vermis Mysteriis

We are so smart, s-m-r-t

Hoomans are smrt.

Hoomans are smrt.

Imagine the most intelligent entity possible. Would that thing be able to understand its own cognition? As you crank up a cognitive system’s ability to model its environment, presumably the cognition system itself gets more difficult to understand.

Is the human cognitive system both smart enough and simple enough to self reverse engineer? It’s probably in the right zone. We seem to be smart enough to understand enough of the issues to take a good run at the problem, because our cognition system is simple enough to not be beyond the ability of our cognition system itself to understand. How’s that for some Hofstadter style recursion.

Anyway enough with the Deep Thoughts. Let’s do some math! Math is fun. Not as fun as universe eating worms. But solving this problem well is important. At least to me and my unborn future vermiform army. Maybe you can help solve it.

A short review of L0-norm sparse coding with structured dictionaries

Last time we discussed sparse coding on the hardware, I introduced an idea for getting around a problem in using D-Wave style processor architectures effectively – the mismatch between the connectivity of the problem we want to solve and the connectivity of the hardware.

Let’s begin by first reviewing the idea. If you’d like a more in-depth overview, here is the original post I wrote about it. Here is the condensed version.

Given

  1. A set of S data objects \vec{z}_s, where each \vec{z}_s is a real valued vector with N components;
  2. An N x K real valued matrix \hat{D}, where K is the number of dictionary atoms we choose, and we define its k^{th} column to be the vector \vec{d}_k;
  3. A K x S binary valued matrix \hat{W}, whose matrix elements are w_{ks};
  4. And a real number \lambda, which is called the regularization parameter,

Find \hat{W} and \hat{D} that minimize

G(\hat{W}, \hat{D} ; \lambda) = \sum_{s=1}^S || \vec{z}_{s} - \sum_{k=1}^{K} w_{ks} \vec{d}_k ||^2 + \lambda \sum_{s=1}^S \sum_{k=1}^{K} w_{ks}

subject to the constraints that \vec{d}_j \cdot \vec{d}_m = 0 for all pairs j,m that are not connected in the quantum chip being used.

To solve this problem, we use block coordinate descent, which works like this:

  1. First, we generate a random dictionary \hat{D}, subject to meeting the orthogonality constraints we’ve imposed on the dictionary atoms.
  2. Assuming these fixed dictionaries, we solve the optimization problem for the dictionary atoms \hat{W}. These optimization problems are now Chimera-structured QUBOs that fit exactly onto the hardware by construction.
  3. Now we fix the weights to these values, and find the optimal dictionary \hat{D}, again subject to our constraints.

We then iterate steps 2 and 3 until G converges to a minimum, keeping in mind that this problem is jointly non-convex and the minimum you get will be a local minimum. Each restart of the whole algorithm from a new standing point will lead to a different local minimum, so a better answer can be had by running this procedure several times.

Step 3: Finding an optimal structured dictionary given fixed weights

The hard problem is Step 3 above. Here the weights \hat{W} are fixed, and we want to find an optimal structured dictionary. Here is the formal statement of the problem.

Given

  1. An N x S real valued matrix \hat{Z}, where S is the number of data objects, and we define the s^{th} column to be the s^{th} data object \vec{z}_s, where each \vec{z}_s is a real valued vector with N components, and the matrix elements of \hat{Z} are z_{ns};
  2. An N x K real valued matrix \hat{D}, where K is the number of dictionary atoms we choose, and we define its k^{th} column to be the vector \vec{d}_k, and the matrix elements of \hat{D} are d_{nk};
  3. And a K x S binary valued matrix \hat{W} with matrix elements w_{ks};

Find \hat{D} that minimizes

G^{*}(\hat{D}) = \sum_{s=1}^S || \vec{z}_{s} - \sum_{k=1}^{K} w_{ks} \vec{d}_k ||^2

{=\sum_{s=1}^S \sum_{n=1}^N (z_{ns}-\sum_{k=1}^{K} w_{ks} d_{nk})^2}

{=||\hat{Z}-\hat{D} \hat{W}||^2=Tr(\hat{A}^T \hat{A})}

where \hat{A}=\hat{Z}-\hat{D} \hat{W}, subject to the constraints that \vec{d}_j \cdot \vec{d}_m = 0 for all pairs j,m that are not connected in the quantum chip being used.

What makes this problem hard is that the constraints on the dictionary atoms are non-linear, and there are a lot of them (one for each pair of variables not connected in hardware).

Ideas for attacking this problem

I’m not sure what the best approach is for trying to solve this problem. Here are some observations:

  • We want to be operating in the regime where \hat{W} is sparse. In this limit most of the w_{ks} will be zero. Because the coupling term is quadratic in \hat{W}‘s matrix elements, for all L0-norm sparse coding problems most of the coupling terms are going to be zero. This suggests a possible strategy where we could first solve for \hat{D} assuming that the quadratic term was zero, and then whatever we do next could use this as an initial starting point.
  • There are some types of matrix operations that would not mess up the structure of the dictionary but would allow parametrization of changes within the allowed space. If we could then optimize over those parameters we could take care of the constraints without having to do any work to enforce them.
  • There is a local search heuristic where you can optimize each dictionary atom \vec{d}_k moving from k=1 to k=K in order while keeping the other columns fixed, and just iterating until convergence (you have to do some rearranging to ensure the orthogonality is maintained throughout using the null space idea in the previous post). This will probably by itself not be a great strategy and will probably get you stuck in local optima but maybe it would work OK.

What do you think? I might be able to get you time on a machine if you can come up with an interesting way to solve this problem effectively… :-)

16 thoughts on “Sparse coding on D-Wave hardware: finding an optimal structured dictionary

  1. Not sure if you’re still interested in this problem, but there’s a way to parameterize the matrices as per your second strategy.

    Basically, if you take the D matrix and premultiply it by a regular square orthonormal matrix (of size N), you get a matrix that still satisfies the constraints on D (i.e. orthogonality on columns that are not connected in the quantum machine). You can check this. Let d_i and d_j be two columns of D. Let their dot product d_i^T d_j be equal to x. Then since Q is orthonormal, it preserves dot products, and so if x is 0, the dot product of Qd_i and Qd_j is going to be zero, and if their dot product is nonzero, their dot product when premultiplied by Q is going to be nonzero.

    The point to all of this is that QD is a valid dictionary matrix and, further, any valid dictionary matrix can be generated in this way. So this is the parameterization you are looking for in strategy 2. So now the problem becomes minimizing ||Z-QY||^2 where Y=DW is given, and we are now simply optimizing over Q, the set of all orthonormal matrices, which have further nice parameterizations as rotations or reflections in N-dimensional space.

      • Optimizing over orthogonal matrices is widely used in ICA methods, so good methods have been developed for it. One way is to restrict ourselves to rotation matrices (SO(n)) and perform gradient descent over the angle(s) of rotation. In other words, try to find the rotation matrix that rotates the columns of Y to get as close as possible to Z. The naive way to do this (multiplying together Givens rotations) isn’t very computationally efficient; better ways involve working with the lie algebra so(n). This is addressed in the article “Lie group methods for optimization with orthogonality constraints” by Mark D. Plumbey:

        http://link.springer.com/content/pdf/10.1007%2F978-3-540-30110-3_157.pdf

        There are two reasons this would be much better than householder reflections:
        1. We can assume the W optimization step has been carried out well, so the columns of Y are already going to be close to Z. Thus a ‘fine tuning’ step would be better suited to optimization than reflections, which would make the objective function worse most of the time.
        2. Rotations are more efficient for sparse matrices, which D is going to wind up being after a few iterations. Reflections are more efficient for dense matrices.

      • Hi Alireza! Thanks very much for this, much appreciated! I think though that the two reasons you give might not be correct for the following reasons. In the first one, since we are doing block coordinate descent, even if we perfectly optimize over W giving a fixed D, those perfect W values are not necessarily close to the globally optimal values of W and D jointly. We hope that as the iterations proceed we might get close but that’s not guaranteed — we only get a local minimum in the joint W, D space because the joint problem is non-convex. In the second one, D never gets sparse — it is always fully populated.

      • As for 2, I might have been projecting my image-processing background bias on to the question. You are right that the D matrix need not be sparse in general. However, I still think 1 is correct. However, it’s just a hunch, and it would need some experimentation to confirm. Speaking of which, are the data sets you are running the problem on (as well as the results from your algorithm) publicly available?

  2. I have read your interview on Nature, where you stated that programming the machine is extremely difficult.

    I have a few suggestions.

    Haskell is a programming language with some characteristics that make it very good at abstracting complexity.

    Get the whitepaper from https://www.fpcomplete.com/business/resources/white-papers. FP Complete is a newly created software maker founded by a former Micrososft executive who was head of Visual C++. They are creating an IDE that simplifies Haskell programming and that will eventually be able to be used offline (right now it is in beta and functions as a SaaS). Read and watch the videos that you find on https://www.fpcomplete.com/. You will also be able to find tutorials on that website.

    Then, read “Learn You a Haskell for Great Good” by Miran Lipovaca, which you can find online at http://learnyouahaskell.com/chapters or in paperback published by No Starch Press.

    Continue by reading “Pearls of Functional Algorithm Design” by Richard Bird, written exactly for Haskell and published by Cambridge University Press. Don’t buy it for Kindle or other electronic formats, as they have rendering problems.

    Next, read “Real World Haskell” that you can find online at http://book.realworldhaskell.org/read/ or in paperback published by O’Reilly Media.

    Read the language specification at http://www.haskell.org/definition/haskell2010.pdf and the issues of Monad.Reader at http://themonadreader.wordpress.com/ and http://www.haskell.org/haskellwiki/The_Monad.Reader/Previous_issues.

    On August, read “Parallel and Concurrent Programming in Haskell” written by Simon Marlow and that will be published by O’Reilly Media.

    I believe that Haskell has the ability to create a logical framework for abstracting D-Wave functions.

    However, Homotopy Type Theory goes much further. Here are the options for reading its book: http://homotopytypetheory.org/book/.

    The project is continually being developed on https://github.com/HoTT.

    Sony faced a similar problem in abstracting the complex functions of what could be defined as the general precursor to the present and future GPUs, the Cell parallel processor used in the PlayStation 3.

    The man who initially understood Cell is Mark Cerny, who is the lead system architect of the PlayStation 4.

    Here are two very revealing videos about him:

    http://www.youtube.com/watch?v=xHXrBnipHyA and http://www.youtube.com/watch?v=C2Cc6Q3cJkc.

    He is a person who independently wrote a 3D projection formula when he was 12 (in 1976, http://youtu.be/C2Cc6Q3cJkc?t=41m9s), was auditing classes at UC Berkeley when he was 13, and was already into the third year knowledge-wise by the time he entered UC Berkeley to study for a BS in Mathematics and Physics when he was 16 (http://youtu.be/C2Cc6Q3cJkc?t=35m36s and http://youtu.be/xHXrBnipHyA?t=3m30s). He left as he felt the degree didn’t interest him and joined Atari at the age of 17 in 1982.

    He has a physics background and states that it helps in the development of video games (http://youtu.be/C2Cc6Q3cJkc?t=40m48s).

    He is in many ways extremely intelligent, and was key to creating the abstract framework to develop software on the Cell, part of which is the SPU Runtime System.

    He was key to customizing AMD’s original GCN architecture to make the APUs as optimized as possible for general computation tasks during the PlayStation 4 development.

    Here are three interviews with him on the PlayStation 4:

    http://www.gamasutra.com/view/feature/191007/, http://www.neogaf.com/forum/showthread.php?t=532077, and http://techon.nikkeibp.co.jp/english/NEWS_EN/20130401/274313/.

    Sony in 2012 bought a cloud gaming company called Gaikai. Its purpose is to build a latency-sensitive datacenter network that is able to stream video games.

    Mark Cerny will most probably develop the combination of Gaikai and PlayStation 5 as he wants to lead Sony Computer Entertainment’s technology for at least one more decade (probably until retirement), as he implied in the first video (http://youtu.be/xHXrBnipHyA?t=46m25s).

    If D-Wave’s architecture allows it either in the present or in the future to be latency-sensitive, meaning that its computation takes less than 100 ms, and D-Wave’s compute capabilities can be used for different parts of a video game’s runtime, you could form a partnership with Sony Computer Entertainment in creating an abstract development framework and selling them future D-Wave machines for use in their cloud gaming datacenters, sort of similarly to what AMD has done in developing its APUs with Sony Computer Entertainment, which in AMD’s case is both a hardware and software partnership.

    Sony would obviously be very interested as it would give them a possible significant technological edge for their cloud gaming effort, and through PlayStation 4 are set to outsell Microsoft’s Xbox One by at least 3:1 given many different reasons.

    Mark Cerny and his world wide team would lead the effort, and it would be very worthwhile. Provided that D-Wave’s architecture can at the very least eventually be latency-sensitive, and that you have the will to pursue such a partnership, which if the D-Wave’s architecture allows for, is an undoubtedly creative one, it would have a very high return for you.

    • Hi Steve! Yes I read the comments and try to respond to most of them. As to the next generation chip, historically we have been doubling qubit count every year (for nine years now!) and will likely stay on that curve. However we want to make the right choices about what to build next, and there are other dimensions in which the processors can be made better, such as increasing the connectivity of the qubits, the treewidth of the hardware graph, and the precision to which the machine language parameters can be set. So it’s possible the next gen could be significantly different in more than just qubit count. But we’ll try to keep on the curve, or ahead of it!

      As to the relative speedup over classical approaches, it’s a complicated business. When you have a totally new thing it’s unreasonable to expect everything to be clear right out of the gate. You can think of this approach as being complimentary to other approaches — it is just doing something totally different than any other optimization algorithm you can run classically.

      Re. the technological singularity — I’m not a big fan of this idea. I don’t think there is a point in time where something special happens. However I do think we are at the stage in development of machine intelligence that being able to do a lot of experiments fast is important for progress, and experiments involving intelligent machines invariably require solving a huge number of optimization problems really, really fast. If D-Wave hardware can help machine intelligence researchers do more experiments, faster, then yes it will play a role.

  3. Hey Geordie -

    Your last post [Sparse coding on D-Wave hardware: finding an optimal structured dictionary] was freaking awesome. I have gone ahead and added your stuff to my Feedly account. Please keep me updated if you post anywhere else.

    Keep rocking –

  4. To add to my previous post, there is a way you can decompose finding D into a discrete optimization part (which is hard) and a continuous optimization part (which is easier in comparison). The key thing to keep in mind is that if the W optimization step is carried out correctly, we don’t need to worry about the discrete optimization part and can instead restrict ourselves to the continuous part. Thus allowing us to use the full force of the quantum computer for the hard bit and solving the easy bit on a classical computer.

    • Hi Alireza! Can you elaborate on this? All the entries in D are reals so the optimization over D (given fixed W) is continuous (and convex in the case of no constraints, and non-convex in the case with the orthogonality constraints). We’re using the QC to solve for W given fixed D (W are bits and the problems being solved are QUBOs), and conventional tactics to solve for D given W.

      • By the ‘continuous part’ I meant an optimization over SO(n), and by ‘discrete’ I mean including reflections (as +1/-1 reflections around a coordinate axis) to cover the full O(n) space. However, this might not be required.

  5. Or, if that turns out not to be acceptable, a basin-hopping method could be used. However, I’m pretty sure that it would not be required.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s