# The recent “How Quantum is the D-Wave Machine?” Shin et.al. paper

Generally I try to avoid commenting on ongoing scientific debates. My view is that good explanations survive scrutiny, and bad explanations do not, and that our role in bringing quantum computers kicking and screaming into the world is to, well, build quantum computers. If people love what we build, and we do everything we can to adjust our approach to make better and better gear for the people who love our computers, we have succeeded and I sleep well.

I am going to make an exception here. Many people have asked me specifically about the recent Shin et. al. paper, and I’d like to give you my perspective on it.

In my world view, science is fundamentally about good explanations.

What does this mean? David Deutsch eloquently describes this point of view in this TED talk.  There is a transcript here. He proposes and defends the idea that progress comes from discovering good explanations for why things are the way they are. You would not be reading this right now if we had not come up with good explanations for what electrons are.

We can and should directly apply these ideas to the question of whether D-Wave processors are quantum or classical. From my perspective, if the correct explanation were ‘it’s classical’, that would be critical to know as quickly as possible, because we could then identify why this was so, and attempt to fix whatever was going wrong. That’s kind of my job. So I need to really understand this sort of thing.

Here are two competing explanations for experiments performed on D-Wave processors.

Explanation #1. D-Wave processors are inherently quantum mechanical, and described by open quantum systems models where the energy scale of the noise is much less than the energy scale of the central quantum system.

Explanation #2. D-Wave processors are inherently classical, and can be described by a classical model with no need to invoke quantum mechanics.

The Shin et. al. paper claims that Explanation #2 is a correct explanation of D-Wave processors. Let’s examine that claim.

Finding good explanations for experimental results

It is common practice that whenever an experiment is reported demonstrating quantum mechanical (or in general non-classical) effects, researchers look for classical models that can provide the same results. A successful theory, however, needs to explain all existing experimental results and not just a few select ones. For example, the classical model of light with the assumption of ether could successfully explain many experiments at the beginning of the 20th century. Only a few unexplained experiments were enough to lead to the emergence of special relativity.

In the case of finding good explanations for the experimental results available for D-Wave hardware, there is a treasure trove of experimental data available. Here is just a small sample. There are experimental results available on single qubits (Macroscopic Resonant Tunneling & Landau-Zener), two qubits (cotunneling) and multiple qubits (now up to about 500) (the eight qubit Nature paper, entanglementresults at 16 qubits, the Boixo et.al. paper).

Let’s see what we get when we apply our two competing explanations of what’s going on inside D-Wave processors to all of this data.

If we assume Explanation #1, we find that a single simple quantum model perfectly describes every single experiment ever done. In the case of the simpler data sets, experimental results agree with quantum mechanics with no free parameters, as it is possible to characterize every single term in the system’s Hamiltonian, including the noise terms.

Explanation #2 however completely fails on every single experiment listed above, except for the Boixo et.al. data (I’ll give you an explanation of why this is shortly). In particular, the eight qubit quantum entanglement measured in Lanting et al. can never be explained by such a model, which rules it out as an explanation of the underlying behavior of the device. Note that this is a stronger result than it’s simply a bad explanation — the model proposed in Shin et. al. makes a prediction about an experiment that you can easily perform on D-Wave processors that contradicts what is observed.

Why the model proposed works in describing the Boixo et.al. data

Because the Shin et. al. model makes predictions that contradict the experimental data for most of the experiments that have been performed on D-Wave chips, it is clearly not a correct explanation of what’s going on inside the processors. So what’s the explanation for the agreement in the case of the Boixo paper? Here’s a possibility, which we can test.

The experiment performed in the Boixo et. al. paper considered a specific use of the processors. This use involved solving a specifically chosen type of problem. It turns out that for this type of problem, multi-qubit quantum dynamics and therefore entanglement are not necessary for the hardware to reach good solutions. In other words, for this experiment, a Bad Explanation (a classical model) can be concocted that matches the results of a fully quantum system.

To be more specific, the Shin et. al. model replaces terms like $J_{ij} \sigma^z_i \sigma^z_j$ with $J_{ij} <\sigma^z_i><\sigma ^z_j>$, where $\sigma^z_i$ is a Pauli matrix and $<\sigma^z_i>$ is the quantum average of $\sigma^z_i$. Since all quantum correlations are gone after such averaging, you can model $<\sigma^z_i>$ as a classical magnetic moment in a 2D plane. But now it is clear that any experiments relying on multi-qubit quantum correlation and entanglement cannot be explained by this simple model.

I’ve proposed an explanation for the agreement between the Shin et.al. model and this particular experiment — that the hardware is fundamentally quantum, but for the particular problem type run, this won’t show up because the problem type is ‘easy’ (in the sense that good solutions can be found without requiring multi-qubit dynamics, and an incorrect classical model can be proposed that nevertheless agrees with the experimental data).

How do we test this explanation? We change the problem type to one where a fundamental difference in experimental outcome between the processor hardware and any classical model is expected. If the Shin et. al. model continues to describe what is observed in that situation, then we have a meaningful result that disagrees with the ‘hardware is quantum’ explanation. If it disagrees with experiment, that supports the ‘hardware is quantum’ and the ‘type of problem originally studied is expected to show the same experimental results for quantum and classical models so it’s just a bad choice if that’s your objective’ explanations.

So a very important test to help determine what is truly going on is to make this change, measure the results and see what’s up. I believe that some of the folks working on our systems are doing this now. Looking forward to seeing the results!

The best explanation we have now is that D-Wave processors are beautifully quantum mechanical

The explanation that D-Wave processors are fundamentally quantum mechanical beautifully explains every single experiment that has ever been performed on them. The degree of agreement is astonishing. The results on the smallest systems, such as the individual qubits, are like nothing I’ve ever seen in terms of agreement of theory and experiment. Some day these will be in textbooks as examples of open quantum systems.

No classical model has ever been proposed that simultaneously explains all of the experiments listed above.

The specific model proposed in Shin et.al. focuses only on one experiment for which there was no expectation of an experimental difference between quantum and classical models and completely (and from my perspective disingenuously) ignores the entire remainder of the mountains of experimental data on the device.

For these reasons, the Shin et.al. results have no validity and no importance.

As an aside, I was disappointed when I saw what they were proposing. I had heard through the grapevine that Umesh Vazirani was preparing some really cool classical model that described the data referred to above and I was actually pretty excited to see it.

When I saw how trivially wrong it was it was like opening a Christmas present and getting socks.

# Interesting new papers from MIT and Harvard folks

There was a really interesting paper posted on the arxiv yesterday, coauthored by Peter Shor and Eddie Farhi. It analyzes ways you can adjust adiabatic quantum optimization algorithms to make them run better. There are some very good ideas here — check it out!

Also on the arxiv recently was this cool paper by Andrew Lucas at Harvard, mapping a lot of NP problems into Ising model problems.

# Training DBMs with physical neural nets

There are a lot of physical neural nets on planet Earth. Just the humans alone account for about 7.139 billion of them. You have one, hidden in close to perfect darkness inside your skull — a complex graph with about 100 billion neurons and 0.15 quadrillion connections between those neurons.

Of course we’d like to be able to build machines that do what that lump of squishy pink-gray goo does. Mostly because it’s really hard and therefore fun. But also because having an army of sentient robots would be super sweet. And it seems sad that all matter can’t be made aware of its own mortality and suffer the resultant existential angst. Stupid 5 billion year old rocks. See how smug you are when you learn about the heat death of the universe.

Biological inspiration

One thing that is also hard, but not that hard, is trying to build different kinds of physical neural nets that are somewhat inspired by our brains. ‘Somewhat inspired’ is a little vague. We don’t actually understand a lot about how brains actually work. But we know a bit. In some cases, such as our visual perception system, we know quite a bit. This knowledge has really helped the algorithmic side of building better and better learning systems.

So let’s explore engineering our own non-biological but biologically inspired physical neural nets. Does this idea make sense? How would we use such things?

Training a Deep Boltzmann Machine

One kind of neural net that’s quite interesting is a Deep Boltzmann Machine (DBM). Recall that a DBM can be thought of as a graph comprising both visible and hidden units. The visible units act as an interface layer between the external universe that the DBM is learning from, and the hidden units which are used to build an internal representation of the DBM’s universe.

A method for training a DBM was demonstrated in this paper. As we discussed earlier, the core mathematical problem for training a DBM is sampling from two different distributions — one where the visible units are clamped to data (the Creature is ‘looking at the world’), and one where the entire network is allowed to run freely (the Creature is ‘dreaming about the world’). In the general case, this is hard to do because the distributions we need to sample from are Boltzmann distributions over all the unclamped nodes of the network. In practice, the connectivity of the graph is restricted and approximate techniques are used to perform the sampling. These ideas allow very large networks to be trained, but this comes with a potentially serious loss of modeling efficiency.

Using physical hardware to perform the sampling steps

Because the sampling steps are a key bottleneck for training DBMs, maybe we could think of a better way to do it. What if we built an actual physical neural net? Could we design something that could do this task better than the software approaches typically used?

Here’s the necessary ingredients:

1. A two-state device that would play the part of the neurons
2. The ability to locally programmatically bias each neuron to preferentially be in either of their states
3. Communications channels between pairs of neurons, where the relative preference of the pair could be set programmatically
4. The ability of the system to reach thermal equilibrium with its environment at a temperature with energy scale comparable to the energy scales of the individual neurons
5. The ability to read out each neuron’s state with high fidelity

If you had these ingredients, you could place the neurons where you wanted them for your network; connect them like you want for your network; program in their local biases and connection weights; allow them to reach thermal equilibrium (i.e. reach a Boltzmann distribution); and then sample by measuring their states.

The key issue here is step 4. The real question, which is difficult to answer without actually building whatever you have in mind, is whether or not whatever the distribution you get in hardware is effective for learning or not. It might not be Boltzmann, because the general case takes exponential time to thermally equilibrate. However the devil is in the details here. The distribution sampled from when alternating Gibbs sampling is done is also not Boltzmann, but it works pretty well. A physical system might be equilibrated well enough by being smart about helping it equilibrate, using sparsely connected graphs, principles like thermal and / or quantum annealing, or other condensed matter physics / statistical mechanics inspired tricks.

The D-Wave architecture satisfies all five of these requirements. You can read about it in detail here. So if you like you can think of that particular embodiment in what follows, but this is more general than that. Any system meeting our five requirements might also work. In the D-Wave design, the step 4 equilibration algorithm is quantum annealing in the presence of a fixed physical temperature and a sparsely locally connected hardware graph, which seems to work very well in practice.

One specific idea for doing this

Let’s focus for a moment on the Vesuvius architecture. Here’s what it looks like for one of the chips in the lab. The grey circles are the qubits (think of them as neurons in this context) and the lines connecting them are the programmable pairwise connection strengths (think of them as connection strengths between neurons).

There are about 500 neurons in this graph. That’s not very many, but it’s enough to maybe do some interesting experiments. For example, the MNIST dataset is typically analyzed using 784 visible units, and a few thousand hidden units, so we’re not all that far off.

Here’s an idea of how this might work. In a typical DBM approach, there are multiple layers. Each individual layers has no connections within it, but adjacent layers are fully connected. Training proceeds by doing alternating Gibbs sampling between two sets of bipartite neurons — none of the even layer neurons are connected, none of the odd layer neurons are connected, but there is dense connectivity between the two groups. The two groups are conditionally independent because of the bipartite structure.

We could try the following. Take all of the neurons in the above graph, and ‘stretch them out’ in a line. The vertices will then have the connections from the above graph. Here’s the idea for a smaller subgraph comprising a single unit cell so you can get the idea.

On the left is the typical view of the Chimera lattice unit cell. On the right is the exact same graph but stretched out into a line.

If you do this with the entire Vesuvius graph, the resultant building block is a set of about 500 neurons with sparse inter-layer connectivity with the same connectivity structure as the Vesuvius architecture.

If we assume that we can draw good Boltzmann-esque samples from this building block, we can tile out enough of them to do what we want using the following idea.

For this idea, we keep the basic structure of a DBM — alternating layers of neurons with full intra-layer connectivity — but instead of having no inter-layer connections, we introduce some that are in the Vesuvius graph. If we need more units than Vesuvius has qubits, we just accept that different Vesuvius blocks won’t have any inter-block lateral connections within layers (i.e. like a typical DBM).

To train this network, we do alternating Gibbs sampling as in a standard DBM, but using the probability distributions obtained by actually running the Vesuvius graph in hardware (biased suitably by the clamped variables) instead of the usual procedure.

Alright so let’s imagine we could equilibrate and draw samples from the above graph really quickly. What does this buy us?

Well the obvious thing is that you can now learn about possible inter-layer correlations. For example, in an image, we know that pixels have local correlations — pixels that are close to each other in an image will tend to be correlated. This type of correlation might be very useful for our model to be able to directly learn. This is the sort of thing that inter-layer correlations within the visible layer might be useful for.

Another interesting possibility is that these inter-layer connections could represent the same input but at different times, the intuition being that inputs that are close in time are also likely to be correlated.

OK well why don’t you try it out?

That is a fabulous idea! I’m going to try this on MNIST and see if I can make it work. Stand by!

# Everything you always wanted to know about what it’s like to work here

We posted a new arxiv preprint today. It is called “Architectural considerations in the design of a superconducting quantum annealing processor”. You can download it here.

It describes how Vesuvius came to be. It is a great story — I think you will like it.

It is like a science fiction detective story outlining in a first hand experience kind of way what it’s like to be on the front lines of a brand new technology. I seriously couldn’t stop reading it once I started. If you’re interested in what it’s really like to work here on this type of stuff, you should read it.

# Six interesting findings from recent benchmarking results

Around May 15th of 2013 Google acquired a system built around a 509-qubit Vesuvius 6 (V6) chip. Since it went online, they have been running it 24/7 at 100% usage. Most of this time has been committed to benchmarking.

Some of these results have been published, and there has been some discussion of what it all means. Here I’d like to provide my own view of where I think we are, and what these results show.

Interesting finding #1: V6 is the first superconducting processor competitive with state of the art semiconducting processors.

Processors made out of superconductors have very interesting properties. The two that have historically driven interest are that they can be extremely fast, and they can operate without requiring lots of power. Interestingly they can even be run close to thermodynamical reversibility — with zero heat generation. There was a serious attempt to make superconducting processors work, at IBM from 1969 to 1983you can read a great first hand account of it here. Unfortunately the technology was not mature enough, semiconducting approaches were immensely profitable at the time, and the effort failed. Subsequently there has been much talk about doing something similar but with our new knowledge, but no-one has followed through.

It is difficult to find the amount of investment that has gone into superconducting processor R&D. As best I can count, the number is about $4B. We account for about 3% of that number; IBM about 15%; and government sponsorship of basic research, primarily in Japan, US and Europe the remainder. Depending on your perspective, this might sound like a lot, or like a very small number — for example, a single TSMC state of the art semiconductor fabrication facility costs about six times this (~$25B) to build. The total investment in semiconductor fabrication facilities and equipment since the early days of Fairchild Semi is now approaching \$1T — yes, T as in Trillion. That doesn’t include any of the investment in actual processors — just the costs of building fabrication facilities.

The results that were recently published in the Ronnow et. al. paper show that V6 is competitive with what’s arguably the most highly optimized semiconductor based solution possible today, even on a problem type that in hindsight was a bad choice. A fact that has not gotten as much coverage as it probably should is that V6 beats this competitor both in wallclock time and scaling for certain problem types. That is a truly astonishing achievement. Mattias Troyer and his team achieved an incredible level of optimization with his simulated annealing code, achieving 200 spin updates per nanosecond using a GPU based approach. The ‘out of the box’ unoptimized V6 system beats this approach for some problem types, and even for problem types where it doesn’t do so well (like the ones described in the Ronnow paper) it holds its own, and even wins in some cases.

This is a remarkable historic achievement. It’s the first delivery on the promise of superconducting processors.

Interesting finding #2: V6 is the first computing system using ideas from quantum information science competitive with the best classical computing systems.

Much like in the case of superconducting processors, the field of quantum computing has promised to provide new ways of doing things that are superior to the ways things are now. And much like superconducting processors, the actual delivery on that promise has been virtually non-existent.

The results of the recent studies show that V6 is the first computing system that uses ideas from quantum information science that is competitive with the best classical algorithms known run on the fastest modern processors available.

This is also a remarkable and historic achievement. It’s the first delivery on the promise of quantum computation.

Interesting finding #3: The problem type chosen for the benchmarking was wrong.

The type of problem that the Ronnow paper looked at — random spin glasses — made a lot of sense when the project began. Unfortunately about midway through the project it was discovered that this type of problem was expected theoretically to show no difference in scaling between simulated annealing (SA) and quantum annealing (QA). This analysis showed that it was necessary to add structure to the problem instances to see a scaling difference between the two. So if an analysis of the D-Wave approach has as its objective observing a scaling difference between SA and QA, random spin glass problems are the wrong choice.

Interesting finding #4: Google seems to love their machine.

Last week Google released a blog post about their benchmarking efforts that provide an overview of how they feel about what they’ve been seeing. Here are some key points they raise in that post.

• In an early test we dialed up random instances and pitted the machine against popular off-the-shelf solvers — Tabu Search, Akmaxsat and CPLEX. At 509 qubits, the machine is about 35,500 times (!) faster than the best of these solvers.

This is an important result. Beating a trillion dollars worth of investment with only the second generation of an entirely new computing paradigm by 35,500 times is a pretty damn awesome achievement. NOTE FOR EXPERTS: CPLEX was NOT run in these tests to global optimality. It was run in a mode where it was timed to the time it found a target solution, and not to the time it took to prove global optimality. In addition, Tabu Search is nearly always the best tool if you don’t know the structure of the QUBO problem you are solving. Beating it by this much is a Big Deal.

• For each classical solver, there are problems for which the hardware does much better.

This is extremely cool also. Even though we are now talking about the best solvers we know how to create, our Vesuvius chip, with about 0.001% of the investment of its competitor, is holding its own.

• A principal reason the portfolio solver is still competitive right now is actually rather mundane — the qubits in the current chip are still only sparsely connected.

This is really important to understand — making the D-Wave technology better is likely about making the problems being solved more rich by adding more couplers to the chip, which is just an engineering issue that is nearly completely decoupled from other things like the role of quantum mechanics in all of this. It is really straightforward to make this change.

• Eyeballing this treasure trove of data, we’re now trying to identify a class of problems for which the current quantum hardware might outperform all known classical solvers.

Now this is really cool. Even for Vesuvius there might be problems for which no known classical computer can compete!

Interesting finding #5: The system has been running 24/7 with not even a second of downtime for about six months.

This is also worth pointing out, as it’s quite a complex machine with the business end at or around 10 millikelvin. This aspect of the machine isn’t as sexy as some of the other issues typically discussed, but it’s evidence that the underlying engineering of the system is really pretty awesome.

Interesting finding #6: The technology has come a long way in a short period of time.

None of the above points were true last year. The discussion is now about whether we can beat any possible computer — even though it’s really only the second generation of an entirely new computing paradigm, built on a shoestring budget.

The next few generations of chip should push us way past this threshold — this is by far the most interesting time in the 15 year history of this project.

# Fusing sensor modalities using an LC-DBM

Our different senses differ in detail. The features that allow effective representation of audio signals are different than those for representing vision. More generally we’d expect that any particular sensor type we’d provide to a new Creature we’re designing would need some new and specific way to effectively represent the types of data it receives from the world, and these representations would be different from sensor to sensor.

The different actuators in our Creature should also get different representations. For example, the motor signals that drive the bulk movement of the Creature (say its wheels) probably have a very different character than those that drive fine motor skills (such as the movement of fingers).

In the DBM framework, there is a natural way to handle this. The approach is called a Locally Connected Deep Boltzmann Machine, or LC-DBM. We briefly encountered this concept in an earlier post, where it made an appearance in this figure (the picture on the right).

Different kinds of Boltzmann Machines. The visible units are grey, the hidden units are white.

Let’s see if we can build an interesting LC-DBM that we can run in hardware.

Embodying Cid in a robot

Imagine we have a robot that has two motors. One of these controls the movement of the back left wheel, and one controls the movement of the back right wheel. The robot will have a castor in front that’s free to rotate, so we get back wheel drive. We’ll assume for this first experiment that these motors only have two settings — off (0) and forward (1). This is not all that restrictive. While there are some things this type of Creature can’t do, he will be able to get to most places using just these movements.

We’ll give each of these motors visible units corresponding to two successive times. Now that we’re starting to think about embodiments, what these times are in actual physical units becomes important. In this case, we’ll set the interval between times to be much longer than the response time of the motors — say 450 ms.

The DBM we’ll start with to represent the two motors at two times will look the same as the one we used in experiment #3. Here it is.

For the motor sector, we use two visible units corresponding to whether a motor is off (0) or moving forward (1) at two successive times. t and t+1 differ by 450 ms.

This Creature is also going to be equipped with a camera, so we can also have vision neurons. A typical camera that you’d mount on a robot provides a huge amount of information, but what we’re going to do is to start off by only using a tiny fraction of it, and in a particularly dumb way. What we’ll do is take the images coming in from the camera, and separate them into two regions — the left and right halves of the full image. We’ll take all of the pixels in each side, average them, and threshold them such that if the average intensity of the pixels is 128 or higher, that means 1 (i.e. bright = 1) otherwise 0 (dark = 0). This mimics the thresholded photodetector ommatidia idea we discussed a couple of posts back, although now we have two of them — one for the left side of the creature’s vision, and one for the right side.

Again we’ll have two successive times. Typical cameras provide around 30 frames per second, which is a lot faster than the time we set for the motor response. So what we’ll do is average the camera results over 15 frames, so that we can keep the difference in time the same as the difference we chose for the motors. Again this is not the smartest thing we could do but we can improve this later! With these choices, here’s the DBM we will use for the vision system.

Note that the unit labels have been shifted by 4 for each from the motor system.

Now let’s equip our Creature with a speaker / microphone. As with the vision system, an audio system we can mount on a robot can provide us with very rich data. But we’ll ignore most of it for the time being. Analogously to the simple system we put in place for vision, let’s again choose two audio neurons, but this time instead of thresholding the intensity of the visual input on the left/right halves of the incoming images, we’ll threshold the intensity of two different frequencies, one low and one high, corresponding to 100 Hz and 1000 Hz. An input in each will be 0 if the fourier component of the signal over a total of 450ms was less than a threshold, and 1 if it’s greater. The idea is that if these frequencies are present, the corresponding audio neuron will be on, otherwise it will be off.

Here’s the DBM for the audio system.

Here’s the audio DBM. We’ve set it up so that there are two audio neurons, one of which fires if it senses 100Hz and the other which fires if it senses 1,000Hz.

Finally, let’s add a weapons system. We’ll mount a missile launcher on the robot. Because firing a missile is serious business, we’ll make it so that both weapons neurons have to be on simultaneously for a firing event, so 00, 01 and 10 mean ‘don’t fire’, and 11 means ‘fire’. Again we’ll have two times, separated by 450 ms. Here’s the weapons system DBM.

For the weapons system, we fire only if the most significant bit (MSB) and least significant bit (LSB) are both 1, else we don’t.

Connecting sensors and actuators at higher levels of the LC-DBM

OK so we have built four different DBMs for audio, vision, motor and weapons. But at the moment they are completely separate. Let’s fix that!

Here is an architecture that brings all four together, by combining the different modalities higher up in the LC-DBM.

An example of how we can connect up all four modalities. The orange level connects audio/weapons, audio/vision and motor/vision; the green level connects vision/audio/weapons and motor/vision/audio; and the top blue level connects all four.

This network can be embedded in hardware. I created this embedding by staring at it for a few minutes. There are probably much better ways to do it. But this one should work. Here it is!

This embedding uses a 4 by 6 block of unit cells, and should allow for all the connections in the network.

So that’s cool! Alright that’s enough for now, next time we’ll think about different experiments we can subject this New and Enhanced Cid to.

# First ever DBM trained using a quantum computer

In Terminator 2, Arnold reveals that his CPU is a neural net processor, a learning computer. Of course it is! What else would it be? Interestingly, there are real neural net processors in the world. D-Wave makes the only superconducting version, but there are other types out there also. Today we’ll use one of our superconducting neural nets to re-run the three experiments we did last time.

I believe this is the first time quantum hardware has been used to train a DBM, although there have been some theoretical investigations.

Embedding into hardware

Recall that the network we were training in the previous post had one visible layer with up to four units, and two hidden layers each with four units. For what follows we’re going to associate each of these units with a specific qubit in a Vesuvius processor. The way we’re going to do this is to use a total of 16 qubits in two unit cells to represent the 12 units in the DBM.

All D-Wave processors can be thought of as hardware neural nets, where the qubits are the neurons and the physical couplers between pairs of qubits are edges between qubits. Specifically you should think of them as a type of Deep Boltzmann Machine (DBM), where specifying the biases and weights in a DBM is exactly like specifying the biases and coupling strengths in a D-Wave processor. As in a DBM, what you get out are samples from a probability distribution, which are the (binary) states of the DBM’s units (both visible and hidden).

In the Vesuvius design, there is an 8×8 tile of eight-qubit unit cells, for a total of 512 ‘neurons’. Each neuron is connected to at most 6 other neurons in Vesuvius. To do the experiments we want to do, we only need two of the 64 unit cells. For the experts out there, we could use the rest to do some interesting tricks to use more of the chip, such as gauge transformations and simple classical parallelism, but for now we’ll just stick to the most basic implementation.

Here is a presentation containing some information about Vesuvius and its design. Take a look at slides 11-17 to get a high level overview of what’s going on.

Here is a picture of the DBM we set up in the last post.

Here we still have two neurons — one vision and one motor — but we have two different times (here labeled t and t+1).

Here is the embedding into hardware we’ll use. Hopefully this is clear! Each of the blue lines is a qubit. The horizontal qubits in unit cell #1 are strongly coupled to the horizontal qubits in unit cell #2 (represented by the red circles). We do this so that the variables in the first hidden layer can talk to all four variables in the second hidden layer (these are the four vertical qubits in unit cell #1) and all four visible units (these are the vertical qubits in unit cell #2).

The embedding into hardware we’ll use here. We use two units cells from the top left hand corner of the chip. The red circles indicate strong ferromagnetic coupling between the horizontal qubits in the two unit cells, which represent the four variables in the first hidden layer. The leftmost four vertical qubits represent the variables in the second hidden layer, while the rightmost four qubits represent the visible units.

Using the chip to replace the alternating Gibbs sampling step

Recall that the algorithm we used for training the DBM required drawing samples from two different distributions — the ‘freely running’ network, and a network with inputs clamped to values set by the data we are learning over. So now we have a hardware neural net. Can we do these two things directly?

The way the chip works is that we first program in a set of biases and weights, and then draw a bunch of samples from the probability distribution they create. So we should be able to do this by following a very simple prescription — do everything exactly the same as before, except replace the alternating Gibbs sampler with samples drawn from the hardware with its machine language parameters set to the current bias, offset and weight values.

The only tricky part of this (and it’s not really all that tricky) is to create the map between the biases, weights and offsets in the software model to the biases and weights in the hardware.

Experimental results: Running a real quantum brain

Here are the results of doing this for the three experiments we set up last time, but now comparing training the DBM using alternating Gibbs sampling in software to training the DBM by drawing samples from a Vesuvius 6 chip. The parameters of the run were 100 problems per minibatch, 100 epochs, 1000 learning steps per epoch, learning rate = 0.005 and reparametrization rate = 0 (I set it to zero just to make everything simpler for debugging — we could make it non-zero if we want).

Comparing Alternating Gibbs Sampling in software (blue) to drawing samples from Vesuvius (red). Both do great!

Same comparison, but for Experiment #2. Here we see something very interesting — the quantum version learns faster and gets a lot smarter!

Same but for experiment #3. Again the quantum version learns faster and gets smarter.

This is just so freaking cool.

A recap

So for the first time ever, a quantum computer has been used to train a DBM. We did this for three different experiments, and plotted the $S_0$ number as a function of epoch for 100 epochs. We compared the results of the DBM training on a Vesuvius chip to the same results using the standard alternating Gibbs sampling approach, and found that for experiments 2 and 3 the quantum version trained faster and obtained better scores.

This better performance is due to the replacement of the approximate AGS step with the correct sampling from the full probability distribution obtained from using Vesuvius.