# Quantum Boltzmann Machines

Sometimes I think about top ten lists. Like what my top ten favorite songs of all time would be, or my top ten favorite books. You can probably tell a lot about someone by what would be on those lists. I once did a personality profiling procedure, which took the answers to about one hundred multiple choice questions and put the respondent into one of 25 72 bins. [Note: I just found the document, there were 72 bins. It was called the Insights Discovery Profile. I was in Bin #22, AKA "Reforming Director"]. The bin I was in was an eerily accurate description of me. I think of this procedure as ‘human dimensionality reduction’. It’s like PCA!

One of my top ten books is On Intelligence by Jeff Hawkins. Jeff was the founder of Palm and Handspring. If you’re not in a hardware company, there is an important fact I will share with you of which you may be unaware. Building hardware that works in the real world is a special kind of hell. I suspect Dante originally had a level of the Inferno where you had to just make carts or butter churns or whatever, but it was so painful to think about he took it out. So I feel a sort of esprit de corps with anyone who has suffered through that special kind of torture.

I first read On Intelligence in 2008, and it was my first exposure to two important ideas. They are:

1. Intelligence is related to, and can even be defined to be, an entity’s ability to build an internal representation of the world, and correctly predict the outcomes of its possible actions within that model.
2. Mammalian brains contain a structure, called the neocortex, that allows mammals to build models of the world out of the same repeating physical structure, tiled out a huge number of times. This repeating structure is hierarchical, and allows mammals to efficiently build representations of the world that are also hierarchical.

In my next series of posts I want to show you something related to both of these points. It’s a fascinating way to connect what D-Wave hardware does to the bleeding edge of machine learning.

# NASA Experts Available For Interviews About Quantum Computing | NASA

This is cool! Check out the link for a resource that can help answer questions about D-Wave quantum computers from the perspective of our users at NASA.

# Lockheed Martin Tweet Chat: #QuantumChat

Lockheed Martin is hosting an interesting event, which is linked to here. It’s an opportunity to talk to people who are working with actual D-Wave quantum computers. If you have questions, now you can have them answered by people who actually know what they are talking about. What a concept! Exciting! Here’s a brief summary, from the event page linked to above:

Join quantum computing experts from Lockheed Martin, the University of Southern California and D-Wave Systems as they “borrow” their companies’ Twitter accounts to discuss the latest in speedy qubits and the quantum evolution.

Tweet your questions to @LockheedMartin, @USCViterbi or @dwavesys with the #QuantumChat hashtag starting Nov. 7. @LockheedMartin will moderate the chat and pose questions beginning at 1 p.m. EDT on Thursday, Nov. 14. Questions will be selected from those tweeted with the #QuantumChat hashtag between now and the end of the chat.

Can’t follow along with the Tweet Chat live? Watch for the full chat transcript on our Storify page.

# D-Wave Systems and Cypress Announce Partnership

Cypress Semiconductor Corporation is a semi- conductor design and manufacturing company founded in 1982 by T. J. Rogers and others from AMD. We’re working with them to build bigger and better superconducting circuits — ultimately millions of qubits and billions of devices per chip. The biggest problem we (or any QC effort that follows us) faces is the manufacturability of designs, and Cypress has one of the most incredible fabrication operations I have ever seen. You can see an overhead shot of the Minnesota facility to the right. This doesn’t do it justice. Acres of fab machines. And the people are top rate. Very exciting.

From HPCWire:

SAN JOSE, Calif. & BURNABY, British Columbia — Cypress Semiconductor Corp. and D-Wave Systems Inc., the world’s first commercial quantum computing company, today announced that D-Wave has successfully transferred its proprietary process technology for building quantum computing microprocessors to Cypress’s Wafer Foundry located in Bloomington, Minnesota. D-Wave selected Cypress as its foundry and started the site change in January of 2013, and Cypress delivered first silicon on June 26. Results from this lot indicate better yields than D-Wave has achieved in the past, validating the quality of Cypress’s production-scale environment.

# We do it because we must

Google produced a most excellent video introducing some of the folks working at the Quantum Artificial Intelligence Lab. Here it is!

You can see some cool shots of our new facility in the piece, like this one.

There are more that a dozen of these machines doing interesting things now. They are crunching away on everything from basic physics experiments, probing entanglement on scales that humans have never been able to before, to commercial applications of machine learning — like the wink detector in the Google Glass product.

There are some great memes in the video. One of my favorites was raised by Sergio Boixo. He says at 4:25,  ‘… [this machine] teaches us that we shouldn’t be naive about the world, and we shouldn’t think about the world as a simple machine. It forces us to consider more sophisticated notions of how the reality around us is actually shaped.’

This reminded me of a great bit that Neil DeGrasse Tyson did. Here he discusses what he calls ‘A fascinatingly disturbing thought’. Here’s a quote (the full version is here):

I lay awake at nights wondering whether simply we as a species are simply too stupid to figure out the Universe that we are investigating, and maybe we need some other species one percent smarter than we are, for which string theory would be intuitive, for which all the greatest mysteries of the Universe, from dark matter, dark energy, the origins of life, and all the frontiers of our thought, would be something that they would just self-intuit. I’m jealous of that possibility. Because I want to be around for those discoveries.

I feel a lot of sympathy for this position. It got me thinking that waiting for the “real one percenters” to show up might be dangerous. Maybe it’s a better strategy to try to create them. If only we had some type of quantum artificial intelligence…

# A visit to IDC’s 50th HPC user forum

Yesterday I was part of a session at IDC’s HPC user forum. This session was interesting because it was about quantum computing. This was the first time the HPC user forum has had a session on quantum computing. I think there will be many more in the future.

Not only was there an entire session on the topic, but the keynote speaker at the event was Charlie Bennett, an IBM Fellow who is well-known to quantum information folks, as (among other things) he co-invented quantum cryptography. I’m going to do a separate post on what he was talking about as it was fascinating.

The session I was part of was led off by Isaac Chuang from MIT, who gave an overview of where he felt the field of quantum information science and technology was. There was a pretty comprehensive overview of a set of experimental results that had been obtained c. 2002 and c. 2013, showing an impressive advance in these results from being able to do about 10 gates on 1 qubit in 2002 to about 10s of gates on about 10 qubits in 2013. Unfortunately, he completely omitted any mention at all of any of our work, or the work of independent folks doing science on our machines. I will send him some copies of Nature.

I was next, and started the festivities by stating that basically I disagreed with everything Ike had said, and was going to give a very different perspective. I felt bad about being confrontational (obviously I still haven’t watched enough Hitchens, I am working on it). But he was in the room, so if he wanted to call me on something he could (he didn’t). A smart non-expert audience that hears completely conflicting stories is going to be confused and wonder what’s up. So I thought it would be in the interests of the audience to put a name on it.

Anyway, once the drama was out of the way, I gave my talk. Here are the slides.

After my talk, Hartmut Neven from Google talked about their D-Wave machine, and what they were doing with it. He described three use cases for machine learning, including finding extremely sparse classifiers, reducing the negative effects of improperly labeled items in supervised machine learning methods, and training and inference in deep learning. One very interesting thing he revealed was that the first of these was used to train blink detectors in the Google Glass product. This is the very first time that a quantum algorithm has been used to develop commercially deployed software.

This is extremely cool, and the beginning of what I see as a new use case opportunity for us. The scenario where you need an always on detector / classifier onboard a device with extreme power constraints is increasingly common. I just absolutely love the idea that the software that makes mobile devices work can be designed by quantum computers.

Next, Dave Wecker from Microsoft gave an overview of his group’s work on building software for programming, compiling and visualizing quantum circuits. The work they are doing is really top notch, and the presentation was great.

Finally, Jay Gambetta from IBM Yorkton Heights gave a talk about the IBM work on transmon qubits. I think the main point of interest for me about this work is how difficult / impossible it would be to scale any of it up. Lots of microwave lines!!!

# Link to Singularity 1-1 interview

I recently had the pleasure of doing an interview with Nikola Danaylov AKA Socrates. I really like his interview series — he’s spoken to a who’s who of interesting people. You should watch them. Here is a link to his site. I especially liked his most recent interview with Kevin Warwick.

In the interview we talk about a whole bunch of topics, including the founding of D-Wave, what D-Wave is all about, the awesomeness of wrestling, superconducting processors, quantum computers, Rose’s Law, the multiverse, stick wielding primates, the Singularity, artificial intelligence, and a bunch of other stuff. Check it out!

# Initial results using new image library

We decided to switch our training and test datasets to one we custom built. This new dataset, which is going to be made publicly available, is formed of seventy 2592 x 1944 pixel color photographs supplied by Suzanne. The training set has sixty images and the test set has ten. Here is training image #57, which is from the top of the Glacier Express on Blackcomb.

Fact: This picture makes me want to go snowboarding.

For the tests I’m going to describe, each image is segmented into 1,024 non-overlapping patches of size 20 x 15 pixels. For the training set, this means a total of 60 * 1,024 = 61,440 patches.

The sparse coding procedure is then applied to these patches, and a 512 atom dictionary is learned that allows reconstructions of all of the image patches.

.

A trial run

The first thing I did is to run through the entire sparse coding procedure, using the L1-norm version using $\lambda = 0.001$ and keeping $N = 507$ SVD modes. Note that the original image patches are vectors of length 20 x 15 x 3 (RGB) = 900; keeping the first 507 of these gives pretty good reproductions. The value of $\lambda$ used is small enough to not regularize strongly and therefore the reconstructions in this run aren’t going to be sparse.

Note that the reason I chose 512 dictionary atoms and $N = 507$ was in anticipation of comparing these results directly to the L0-norm structured dictionaries run in hardware, where for reasons I outlined in earlier posts these are sensible choices (number of dictionary atoms should equal number of qubits and should be bigger than $N$ for overcompleteness, but $N$ has to be at least of length number of atoms minus the connectivity of the least connected qubit (in the case of a Vesuvius chip, this is 5) in order for the structured dictionaries idea to work.

Here are some run results.

• The average sparsity is: 187.585 atoms used per reconstruction
• The reconstruction error on the training set is: 13.527
• The average reconstruction error per image is: 0.000220
• The wallclock time is:143380.135409 seconds running on 200 cores
• Lowest objective function value: 87.9729119482.

The average number of atoms used per image to autoencode the training set was 187.6 / 512 , so we were right about this value of $\lambda$ begin too small to make reconstructions sparse. But it’s probably big enough to start seeing some of the features we want.

The dictionary learned

Here are the dictionary atoms learned for these parameters. In this picture the atoms have no particular order.

A 512 atom dictionary, learned over 61,440 image patches from natural images.

These look pretty good. Note that there is a mix of random-looking atoms and ones with structure that look like edges. We see this type of result for atoms learned in when the reconstructions are in the region between no regularization at all and the sparse regime. I think what’s happening is that the random looking atoms have the freedom to ‘memorize’ the training set, reducing the reconstruction error, but not really learning anything generalizable about images. The edge-type ones are what we are really after, and we start to see some showing up here, but we need a larger value of $\lambda$ to force most or all of the atoms to be like that.

A second trial with larger $\lambda$

Next I re-ran the procedure with $\lambda = 0.01$ on 300 cores.

Here are some run results.

• The average sparsity is: 27.5803548177 atoms used per reconstruction
• The reconstruction error on the training set is: 127.292824922
• The average reconstruction error per image is: 0.0020718233223
• The wallclock time is:15340 seconds
• Lowest objective function value: 492.77728.

The dictionary learned

This looks great. This is about where we want to be. As in prior runs, somewhere in the range of 20-50 atoms per reconstruction seems to be a good place to be for L1 for the parameters we’re using.

So we’re good, infrastructure seems to be working (mostly). Now I’m going to do a sweep through $\lambda$ and plot reconstruction error on both training and test sets as a function of average number of atoms used, for L1, L0 and structured dictionaries.

A 512 atom dictionary with lambda = 0.01 and N = 507, learned over 61,440 image patches.

# Progress on solving the constrained optimization problem for the dictionary

It’s a strange thing, but adding constraints to a problem often helps solve it. I’ve seen this happen many times. I think the reason is that if there are too many options, it’s very hard to focus on a very specific objective, and without this kind of specificity it’s very hard to make progress.

Interestingly, it seems as if restricting choices makes people happier also. You should watch this TED talk, it is fascinating.

Also you should watch this to fully appreciate the cosmic horror of freedom.

.

Constraints! Horrible, horrible constraints!

In my last post I introduced a problem that arises when we structure the dictionary so that the QUBOs sparse coding generates have some desired structure. In particular we want to constrain the QUBOs to have the specific connectivity of some given D-Wave processor, so that we don’t have to do any contortions to map problems into hardware. In general, this idea of building algorithms subject to them having to use the hardware in its native mode is a great idea. Endorsed by space ant colonies everywhere.

Unfortunately in this case, when we do this one of the other steps in the sparse coding procedure goes from easy to hard. Here is the new, hard problem that we need to solve.

Finding an optimal structured dictionary given fixed weights

Given

1. An $N x S$ real valued matrix $\hat{Z}$, where $S$ is the number of data objects, and we define the $s^{th}$ column to be the $s^{th}$ data object $\vec{z}_s$, where each $\vec{z}_s$ is a real valued vector with $N$ components, and the matrix elements of $\hat{Z}$ are $z_{ns}$;
2. An $N x K$ real valued matrix $\hat{D}$, where $K$ is the number of dictionary atoms we choose, and we define its $k^{th}$ column to be the vector $\vec{d}_k$, and the matrix elements of $\hat{D}$ are $d_{nk}$;
3. And a $K x S$ binary valued matrix $\hat{W}$ with matrix elements $w_{ks}$;

Find $\hat{D}$ that minimizes

$G^{*}(\hat{D}) = \sum_{s=1}^S || \vec{z}_{s} - \sum_{k=1}^{K} w_{ks} \vec{d}_k ||^2$

${=\sum_{s=1}^S \sum_{n=1}^N (z_{ns}-\sum_{k=1}^{K} w_{ks} d_{nk})^2}$

${=||\hat{Z}-\hat{D} \hat{W}||^2=Tr(\hat{A}^T \hat{A})}$

where $\hat{A}=\hat{Z}-\hat{D} \hat{W}$, subject to the constraints that $\vec{d}_j \cdot \vec{d}_m = 0$ for all pairs $j,m$ that are not connected in the quantum chip being used.

Here is what I’ve implemented

The structured dictionaries idea was invented by Bill Macready, who runs software & algorithms at D-Wave. When he originally implemented this idea (back in 2009), the strategy he used to solve the dictionary optimization problem had four components, which he ran serially. The four components were:

1. Use the Cayley transform;
2. find optimal scaling of the dictionary columns;
3. do a Householder reflection;
4. and optimize each column in series, fixing the values of all the others during the optimization.

Let’s take a closer look at each of these procedures.

The Cayley transform

The Cayley transform is a very interesting technique that allows optimization of the elements of a matrix (it finds local minima which isn’t great, but it’s a hard problem) while preserving orthogonality constraints between columns of the matrix — just what we need!

In response to my last post, when I asked people for ideas about how to solve this problem, Slartibartfastibast referred me to a paper that is posted here. Everything you ever wanted to know about the Cayley transform and how it applies to our case can be found there. This resource was really helpful. Thanks Slart. While my implementation was based on Bill’s earlier work, the core ideas were the same as these and helped me quite a bit.

Finding the optimal scaling of the dictionary atoms

Given an input dictionary, this operation asks for the optimal norm of each column of the dictionary — basically we multiply each column of the input dictionary by a real variable, and then optimize over all those variables. It turns out this simple operation really helps quite a bit.

The Householder transform

In the comments section of my last post, Alireza Nejati wrote:

Basically, if you take the D matrix and premultiply it by a regular square orthonormal matrix (of size N), you get a matrix that still satisfies the constraints on D (i.e. orthogonality on columns that are not connected in the quantum machine). You can check this. Let d_i and d_j be two columns of D. Let their dot product d_i^T d_j be equal to x. Then since Q is orthonormal, it preserves dot products, and so if x is 0, the dot product of Qd_i and Qd_j is going to be zero, and if their dot product is nonzero, their dot product when premultiplied by Q is going to be nonzero.

The point to all of this is that QD is a valid dictionary matrix and, further, any valid dictionary matrix can be generated in this way. So this is the parameterization you are looking for in strategy 2. So now the problem becomes minimizing ||Z-QY||^2 where Y=DW is given, and we are now simply optimizing over Q, the set of all orthonormal matrices, which have further nice parameterizations as rotations or reflections in N-dimensional space.

This is all true! Thanks Alireza! and was one of the operations Bill had in his original implementation. The operation Bill used finds the optimal Householder transformation. Interestingly what I find is that this is not very helpful in the optimization — when I do it, it rarely reduces the objective function by much. There are other parametrizations of these types of transform. One of these was suggested by Alireja in the comments section. I’m not sure that suggestion will work well, but it is probably worth trying!

Single column optimizations

If you fix all columns in the dictionary, and allow all the elements in a single column to be variables, you can optimize these by computing the nullspace basis of the column and then optimizing within that space. You can do this column-wise, and reduce the value of the objective function by doing this. These are very ‘local’ adjustments, and tend to provide a small amount of benefit.

That’s all I’ve got in this right now. Next I’m going to repeat some of my early analyses including the structured dictionaries approach. Exciting!