Training DBMs with physical neural nets

There are a lot of physical neural nets on planet Earth. Just the humans alone account for about 7.139 billion of them. You have one, hidden in close to perfect darkness inside your skull — a complex graph with about 100 billion neurons and 0.15 quadrillion connections between those neurons.

Of course we’d like to be able to build machines that do what that lump of squishy pink-gray goo does. Mostly because it’s really hard and therefore fun. But also because having an army of sentient robots would be super sweet. And it seems sad that all matter can’t be made aware of its own mortality and suffer the resultant existential angst. Stupid 5 billion year old rocks. See how smug you are when you learn about the heat death of the universe.

Biological inspiration

One thing that is also hard, but not that hard, is trying to build different kinds of physical neural nets that are somewhat inspired by our brains. ‘Somewhat inspired’ is a little vague. We don’t actually understand a lot about how brains actually work. But we know a bit. In some cases, such as our visual perception system, we know quite a bit. This knowledge has really helped the algorithmic side of building better and better learning systems.

So let’s explore engineering our own non-biological but biologically inspired physical neural nets. Does this idea make sense? How would we use such things?

Training a Deep Boltzmann Machine

One kind of neural net that’s quite interesting is a Deep Boltzmann Machine (DBM). Recall that a DBM can be thought of as a graph comprising both visible and hidden units. The visible units act as an interface layer between the external universe that the DBM is learning from, and the hidden units which are used to build an internal representation of the DBM’s universe.

A method for training a DBM was demonstrated in this paper. As we discussed earlier, the core mathematical problem for training a DBM is sampling from two different distributions — one where the visible units are clamped to data (the Creature is ‘looking at the world’), and one where the entire network is allowed to run freely (the Creature is ‘dreaming about the world’). In the general case, this is hard to do because the distributions we need to sample from are Boltzmann distributions over all the unclamped nodes of the network. In practice, the connectivity of the graph is restricted and approximate techniques are used to perform the sampling. These ideas allow very large networks to be trained, but this comes with a potentially serious loss of modeling efficiency.

Using physical hardware to perform the sampling steps

Because the sampling steps are a key bottleneck for training DBMs, maybe we could think of a better way to do it. What if we built an actual physical neural net? Could we design something that could do this task better than the software approaches typically used?

Here’s the necessary ingredients:

  1. A two-state device that would play the part of the neurons
  2. The ability to locally programmatically bias each neuron to preferentially be in either of their states
  3. Communications channels between pairs of neurons, where the relative preference of the pair could be set programmatically
  4. The ability of the system to reach thermal equilibrium with its environment at a temperature with energy scale comparable to the energy scales of the individual neurons
  5. The ability to read out each neuron’s state with high fidelity

If you had these ingredients, you could place the neurons where you wanted them for your network; connect them like you want for your network; program in their local biases and connection weights; allow them to reach thermal equilibrium (i.e. reach a Boltzmann distribution); and then sample by measuring their states.

The key issue here is step 4. The real question, which is difficult to answer without actually building whatever you have in mind, is whether or not whatever the distribution you get in hardware is effective for learning or not. It might not be Boltzmann, because the general case takes exponential time to thermally equilibrate. However the devil is in the details here. The distribution sampled from when alternating Gibbs sampling is done is also not Boltzmann, but it works pretty well. A physical system might be equilibrated well enough by being smart about helping it equilibrate, using sparsely connected graphs, principles like thermal and / or quantum annealing, or other condensed matter physics / statistical mechanics inspired tricks.

The D-Wave architecture satisfies all five of these requirements. You can read about it in detail here. So if you like you can think of that particular embodiment in what follows, but this is more general than that. Any system meeting our five requirements might also work. In the D-Wave design, the step 4 equilibration algorithm is quantum annealing in the presence of a fixed physical temperature and a sparsely locally connected hardware graph, which seems to work very well in practice.

One specific idea for doing this

Let’s focus for a moment on the Vesuvius architecture. Here’s what it looks like for one of the chips in the lab. The grey circles are the qubits (think of them as neurons in this context) and the lines connecting them are the programmable pairwise connection strengths (think of them as connection strengths between neurons).

vesuvius_connectivityThere are about 500 neurons in this graph. That’s not very many, but it’s enough to maybe do some interesting experiments. For example, the MNIST dataset is typically analyzed using 784 visible units, and a few thousand hidden units, so we’re not all that far off.

Here’s an idea of how this might work. In a typical DBM approach, there are multiple layers. Each individual layers has no connections within it, but adjacent layers are fully connected. Training proceeds by doing alternating Gibbs sampling between two sets of bipartite neurons — none of the even layer neurons are connected, none of the odd layer neurons are connected, but there is dense connectivity between the two groups. The two groups are conditionally independent because of the bipartite structure.

We could try the following. Take all of the neurons in the above graph, and ‘stretch them out’ in a line. The vertices will then have the connections from the above graph. Here’s the idea for a smaller subgraph comprising a single unit cell so you can get the idea.

On the left is the typical view of the Chimera lattice unit cell. On the right is the exact same graph but stretched out into a line.

On the left is the typical view of the Chimera lattice unit cell. On the right is the exact same graph but stretched out into a line.

If you do this with the entire Vesuvius graph, the resultant building block is a set of about 500 neurons with sparse inter-layer connectivity with the same connectivity structure as the Vesuvius architecture.

If we assume that we can draw good Boltzmann-esque samples from this building block, we can tile out enough of them to do what we want using the following idea.

For this idea, we keep the basic structure of a DBM -- alternating layers of fully connected neurons -- but instead of having no inter-layer connections, we introduce some that are in the Vesuvius graph. If we need more units than Vesuvius has qubits, we just accept that different Vesuvius blocks won't have any lateral connections within layers (i.e. like a typical DBM).

For this idea, we keep the basic structure of a DBM — alternating layers of neurons with full intra-layer connectivity — but instead of having no inter-layer connections, we introduce some that are in the Vesuvius graph. If we need more units than Vesuvius has qubits, we just accept that different Vesuvius blocks won’t have any inter-block lateral connections within layers (i.e. like a typical DBM).

To train this network, we do alternating Gibbs sampling as in a standard DBM, but using the probability distributions obtained by actually running the Vesuvius graph in hardware (biased suitably by the clamped variables) instead of the usual procedure.

What might this buy us?

Alright so let’s imagine we could equilibrate and draw samples from the above graph really quickly. What does this buy us?

Well the obvious thing is that you can now learn about possible inter-layer correlations. For example, in an image, we know that pixels have local correlations — pixels that are close to each other in an image will tend to be correlated. This type of correlation might be very useful for our model to be able to directly learn. This is the sort of thing that inter-layer correlations within the visible layer might be useful for.

Another interesting possibility is that these inter-layer connections could represent the same input but at different times, the intuition being that inputs that are close in time are also likely to be correlated.

OK well why don’t you try it out?

That is a fabulous idea! I’m going to try this on MNIST and see if I can make it work. Stand by!

Everything you always wanted to know about what it’s like to work here

We posted a new arxiv preprint today. It is called “Architectural considerations in the design of a superconducting quantum annealing processor”. You can download it here.

It describes how Vesuvius came to be. It is a great story — I think you will like it.

It is like a science fiction detective story outlining in a first hand experience kind of way what it’s like to be on the front lines of a brand new technology. I seriously couldn’t stop reading it once I started. If you’re interested in what it’s really like to work here on this type of stuff, you should read it.

“Inside the chip” – new video showing Rainier 128 processor

Here is a video showing how some of the parts of a D-Wave Rainier processor go together to create the fabric of the quantum computer.

The animation shows how the processor is made up of 128 qubits, 352 couplers and nearly 24,000 Josephson junctions. The qubits are arranged in a tiling pattern to allow them to connect to one another.


Vesuvius: A closer look – 512 qubit processor gallery

The next generation of D-Wave’s technology is called Vesuvius, and it’s going to be a very interesting processor. The testing and development of this new generation of quantum processor is going well. In the meantime, here are some beautiful images of Vesuvius!

quantum computer quantum computing D-Wave Systems Vesuvius

Above: An entire wafer of Vesuvius processors after the full fabrication process has completed.


quantum computer quantum computing D-Wave Systems Vesuvius

Above: Photographing the wafer from a different angle allows more of the structure to be seen. Exercise for the reader: Estimate the number of qubits in this image :)


quantum computer quantum computing D-Wave Systems Vesuvius

Above: A slightly closer view of part of the wafer. The small scale of the structures (<1um) produces a diffraction grating effect (like you see on the underside of a CD) resulting in a beautiful spectrum of colours reflecting from the wafer surface.


quantum computer quantum computing D-Wave Systems Vesuvius

Above: A different angle of shot produces different colours and allows different areas of the circuitry to become visible.


quantum computer quantum computing D-Wave Systems Vesuvius

Above: A close-up image of a single Vesuvius processor on the wafer. The white square seen to the right of the image contains the main ‘fabric’ of 512 connected qubits.


quantum computer quantum computing D-Wave Systems Vesuvius

Above: An image of a processor wire-bonded to the chip carrier, ready to be installed into the computer system. The wires carry the signals to the quantum components and associated circuitry on the chip.


quantum computer quantum computing D-Wave Systems Vesuvius

Above: A larger view of the bonded Vesuvius processor. More of the chip packaging is now also visible in the image.


quantum computer quantum computing D-Wave Systems Vesuvius

Above: The full chip packaging is visible, complete with wafer.


Implementation of a Quantum Annealing Algorithm Using a Superconducting Circuit

Implementation of a Quantum Annealing Algorithm Using a Superconducting Circuit

A circuit consisting of a network of coupled compound Josephson junction rf-SQUID flux qubits has been used to implement an adiabatic quantum optimization algorithm. It is shown that detailed knowledge of the magnitude of the persistent current as a function of annealing parameters is key to implementation of the algorithm on this particular type of hardware. Experimental results contrasting two annealing protocols, one with and one without active compensation for the growth of the qubit persistent current during annealing, are presented in order to illustrate this point.


Multi-qubit synchronization results

Synchronization of Multiple Coupled rf-SQUID Flux Qubits

A practical strategy for synchronizing the properties of compound Josephson junction rf-SQUID qubits on a multiqubit chip has been demonstrated. The impacts of small (~ 1 %) fabrication variations in qubit inductance and critical current can be minimized by the application of a custom tuned flux offset to the CJJ structure of each qubit. This strategy allows for simultaneous synchronization of the qubit persistent current and tunnel splitting over a range of external bias parameters that is relevant for the implementation of an adiabatic quantum processor.