Training DBMs with physical neural nets

There are a lot of physical neural nets on planet Earth. Just the humans alone account for about 7.139 billion of them. You have one, hidden in close to perfect darkness inside your skull — a complex graph with about 100 billion neurons and 0.15 quadrillion connections between those neurons.

Of course we’d like to be able to build machines that do what that lump of squishy pink-gray goo does. Mostly because it’s really hard and therefore fun. But also because having an army of sentient robots would be super sweet. And it seems sad that all matter can’t be made aware of its own mortality and suffer the resultant existential angst. Stupid 5 billion year old rocks. See how smug you are when you learn about the heat death of the universe.

Biological inspiration

One thing that is also hard, but not that hard, is trying to build different kinds of physical neural nets that are somewhat inspired by our brains. ‘Somewhat inspired’ is a little vague. We don’t actually understand a lot about how brains actually work. But we know a bit. In some cases, such as our visual perception system, we know quite a bit. This knowledge has really helped the algorithmic side of building better and better learning systems.

So let’s explore engineering our own non-biological but biologically inspired physical neural nets. Does this idea make sense? How would we use such things?

Training a Deep Boltzmann Machine

One kind of neural net that’s quite interesting is a Deep Boltzmann Machine (DBM). Recall that a DBM can be thought of as a graph comprising both visible and hidden units. The visible units act as an interface layer between the external universe that the DBM is learning from, and the hidden units which are used to build an internal representation of the DBM’s universe.

A method for training a DBM was demonstrated in this paper. As we discussed earlier, the core mathematical problem for training a DBM is sampling from two different distributions — one where the visible units are clamped to data (the Creature is ‘looking at the world’), and one where the entire network is allowed to run freely (the Creature is ‘dreaming about the world’). In the general case, this is hard to do because the distributions we need to sample from are Boltzmann distributions over all the unclamped nodes of the network. In practice, the connectivity of the graph is restricted and approximate techniques are used to perform the sampling. These ideas allow very large networks to be trained, but this comes with a potentially serious loss of modeling efficiency.

Using physical hardware to perform the sampling steps

Because the sampling steps are a key bottleneck for training DBMs, maybe we could think of a better way to do it. What if we built an actual physical neural net? Could we design something that could do this task better than the software approaches typically used?

Here’s the necessary ingredients:

  1. A two-state device that would play the part of the neurons
  2. The ability to locally programmatically bias each neuron to preferentially be in either of their states
  3. Communications channels between pairs of neurons, where the relative preference of the pair could be set programmatically
  4. The ability of the system to reach thermal equilibrium with its environment at a temperature with energy scale comparable to the energy scales of the individual neurons
  5. The ability to read out each neuron’s state with high fidelity

If you had these ingredients, you could place the neurons where you wanted them for your network; connect them like you want for your network; program in their local biases and connection weights; allow them to reach thermal equilibrium (i.e. reach a Boltzmann distribution); and then sample by measuring their states.

The key issue here is step 4. The real question, which is difficult to answer without actually building whatever you have in mind, is whether or not whatever the distribution you get in hardware is effective for learning or not. It might not be Boltzmann, because the general case takes exponential time to thermally equilibrate. However the devil is in the details here. The distribution sampled from when alternating Gibbs sampling is done is also not Boltzmann, but it works pretty well. A physical system might be equilibrated well enough by being smart about helping it equilibrate, using sparsely connected graphs, principles like thermal and / or quantum annealing, or other condensed matter physics / statistical mechanics inspired tricks.

The D-Wave architecture satisfies all five of these requirements. You can read about it in detail here. So if you like you can think of that particular embodiment in what follows, but this is more general than that. Any system meeting our five requirements might also work. In the D-Wave design, the step 4 equilibration algorithm is quantum annealing in the presence of a fixed physical temperature and a sparsely locally connected hardware graph, which seems to work very well in practice.

One specific idea for doing this

Let’s focus for a moment on the Vesuvius architecture. Here’s what it looks like for one of the chips in the lab. The grey circles are the qubits (think of them as neurons in this context) and the lines connecting them are the programmable pairwise connection strengths (think of them as connection strengths between neurons).

vesuvius_connectivityThere are about 500 neurons in this graph. That’s not very many, but it’s enough to maybe do some interesting experiments. For example, the MNIST dataset is typically analyzed using 784 visible units, and a few thousand hidden units, so we’re not all that far off.

Here’s an idea of how this might work. In a typical DBM approach, there are multiple layers. Each individual layers has no connections within it, but adjacent layers are fully connected. Training proceeds by doing alternating Gibbs sampling between two sets of bipartite neurons — none of the even layer neurons are connected, none of the odd layer neurons are connected, but there is dense connectivity between the two groups. The two groups are conditionally independent because of the bipartite structure.

We could try the following. Take all of the neurons in the above graph, and ‘stretch them out’ in a line. The vertices will then have the connections from the above graph. Here’s the idea for a smaller subgraph comprising a single unit cell so you can get the idea.

On the left is the typical view of the Chimera lattice unit cell. On the right is the exact same graph but stretched out into a line.

On the left is the typical view of the Chimera lattice unit cell. On the right is the exact same graph but stretched out into a line.

If you do this with the entire Vesuvius graph, the resultant building block is a set of about 500 neurons with sparse inter-layer connectivity with the same connectivity structure as the Vesuvius architecture.

If we assume that we can draw good Boltzmann-esque samples from this building block, we can tile out enough of them to do what we want using the following idea.

For this idea, we keep the basic structure of a DBM -- alternating layers of fully connected neurons -- but instead of having no inter-layer connections, we introduce some that are in the Vesuvius graph. If we need more units than Vesuvius has qubits, we just accept that different Vesuvius blocks won't have any lateral connections within layers (i.e. like a typical DBM).

For this idea, we keep the basic structure of a DBM — alternating layers of neurons with full intra-layer connectivity — but instead of having no inter-layer connections, we introduce some that are in the Vesuvius graph. If we need more units than Vesuvius has qubits, we just accept that different Vesuvius blocks won’t have any inter-block lateral connections within layers (i.e. like a typical DBM).

To train this network, we do alternating Gibbs sampling as in a standard DBM, but using the probability distributions obtained by actually running the Vesuvius graph in hardware (biased suitably by the clamped variables) instead of the usual procedure.

What might this buy us?

Alright so let’s imagine we could equilibrate and draw samples from the above graph really quickly. What does this buy us?

Well the obvious thing is that you can now learn about possible inter-layer correlations. For example, in an image, we know that pixels have local correlations — pixels that are close to each other in an image will tend to be correlated. This type of correlation might be very useful for our model to be able to directly learn. This is the sort of thing that inter-layer correlations within the visible layer might be useful for.

Another interesting possibility is that these inter-layer connections could represent the same input but at different times, the intuition being that inputs that are close in time are also likely to be correlated.

OK well why don’t you try it out?

That is a fabulous idea! I’m going to try this on MNIST and see if I can make it work. Stand by!

23 thoughts on “Training DBMs with physical neural nets

  1. Very interesting read. As we create more complicated neural nets do you think a human-level AI will require a quantum computer with as many qubits as there are neurons in the human brain? Also may I suggest an excellent book I read a while back. It’s called On Intelligence by Jeff Hawkins – the fonder of Palm Computing. You guys should get together🙂

    • Hi George! Hawkins’ book was a major inspiration me to and I think many of the people in this field. It is a classic.

      As to the role of specialized hardware in AI — the question isn’t so much whether you need it to actually run the algorithms underlying human-level sentience once we know what they are (I don’t think you would). But actually finding those algorithms in the first place is going to be done through high-throughput empirically driven experimentation, and in order to do that, you need to be able to do A LOT of experiments FAST.

      So finding the algorithms, and the hyperparameters for getting them working well, is going to consume many orders of magnitude more compute effort than actually running them once we have them. That’s where special-purpose hardware might play a major role, and there is no technological reason why we couldn’t build superconducting neural nets with 100 billion qubits and 0.15 quadrillion connections. In fact if we continue to double the qubit count every year (like we have done now for 10 years) we’ll get there in 2041 (around 30 years). That might sound like a long way away, but consider the trajectory of Intel, from the 4004 in 1971 to things like the Xeon E5-2697 in 2013. That took 42 years of exponential growth with profit at every point along the way. The conditions are right for the same thing to happen here.

  2. Why not use tunnel diodes for item 1? They are two state two terminal devices, come with different sets of operating characteristics and can be driven using ferrite cores with different turns ratios (“weights”). They also work at room temperature.

    On page 343 of Ken Cattermole’s book Principles of Pulse Code Modulation he writes of “Several binary decision processes overlapping in time” (page 343/344) using a process like this.
    http://www.amazon.co.uk/Principles-pulse-code-modulation-Cattermole/dp/0444197478
    http://www.amazon.com/Principles-pulse-code-modulation-Cattermole/dp/0444197478

    • Hi John! Maybe? As long as you could design a processor satisfying the five things I mentioned you’re good to go! Note that it would be cool to operate at room temperature, but this places constraints on the energy scales of the devices to make this work (step 4 in particular). Brains do it, so it’s possible, but just having a two-state device isn’t enough.

  3. Hi Geordie,
    This is very cool! But I’m still trying to wrap my head around the architecture that you’ve designed here. So each one of these 500 qbit blocks is essentially a Vesuvius chip. Does that mean you’re running a whole bunch of these chips in parallel or are you training each block separately on the same chip and then storing the weights on a classical computer? And would you say that this architecture could be seen as a RBM with each block acting as a neuron?

  4. ps it would be cool if you could turn on the “popular blogs” and “recent commenters” plugins on your blog, it helps a lot to track/explore the site, thx much

    • Also, I read another commenter’s post on one of your later blog entries, and I think she has a point; 3D packaging and trigate spin-transistors would not be a bad idea.

  5. Hi, I am a new comer.
    To my understanding, neural networks represent information as a matrix of weight parameters. When people train a network in conventional computers, they use gradient descent or statistical methods. However, quantum computing is different from conventional computing. I just wonder, have you tried evolutionary technique (e.g. genetic algorithm) to search for the best parameters? Genetic algorithm is not efficient in conventional computers because you need to produce a huge population of candidates, but the situation may be different for quantum computers (e.g. http://www.hindawi.com/journals/mpe/2013/730749/). If you are interested, please email me dr.david.chik at googlemail.com and we can discuss in more detail.

    • Hi David! What quantum computers do is provide samples from a probability distribution. This distribution is not so far away from a Boltzmann distribution over energies E(x_0, …, x_n) = \sum_{k=0}^n h_k x_k + \sum_{k,m=1}^n J_km x_k x_m (ie quadratic in x) where the x_k are binary.

      This type of sampling is at the heart of many different types of ML algorithms and so it would be cool to use it instead of the things people usually use (like contrastive divergence and its relatives).

      You can think of the samples you get from these distributions in a ‘genetic algorithm’ context as potential candidate solutions to something. But fundamentally what the system does is sample from a Boltzmann like distribution over binary variables with a quadratic energy function.

  6. Dear Geordie, So… one qubit can represent one neuron, rather than one connection? In conventional neural networks, N neurons generate some N^2 connections and you need some bits to store the value of each connection, but in quantum computing you only need to represent the neurons but not their connections? If this is true, then it will be a big saving in memory! Look forward to your results on MNIST, CIFAR-10, CIFAR-100, etc. I do hope that D-Wave can beat deep learning (you can find the best performance so far in http://rodrigob.github.io/are_we_there_yet/build/ ) Please keep in touch.

    • One qubit is one neuron, and the physical connections between qubits are the allowed connections between neurons. You don’t get native all to all connections in any physical neural net (including our brains) because of the N^2 issue — there isn’t a way to lay out a repeating structure on a chip that gives you all to all connectivity. The solution is to push the per neuron connectivity up as far as the underlying technology will allow, but it will never be all to all in a scalable design.

    • This is the right sort of idea. You can think of the QC HW as a physical annealer that uses QM to help the process. The trick is to find ways to use the capability effectively in practice. This generally means finding ways to structure the learning network so that the connectivity structure of the HW matches the structure of something in the data we’re trying to learn.

  7. Dear Geordie,

    Have you tried all-to-all coupling for all 1000+ qubits? Splitting into several hidden layers (deep learning) is based on classical thought. Perhaps for quantum computing we need to think differently – it starts with all-to-all coupling and then settles as a reservoir with a particular structure depending on the problem.

    • The main issue you need to deal with in this regard is the connectivity structure of the actual QC hardware. That’s the main constraint in designing connectivity for learning architectures. You can (and probably should) develop hybrids where the computation happening in the QC HW is a modular building block which is wrapped in other conventional computation. For example you can use the HW as layers in a deep architecture where you all-to-all connect (in SW) these layers and use conventional techniques to train the full network. Not all the network needs to be in HW, just parts of it.

  8. In addition, may I ask how do you present the training examples? Conventional methods (e.g. stochastic gradient descent) will show one training example and change some connections of the network a little bit and then show another training example, so we can see the network slowly migrates to a higher accuracy when more and more examples are shown. In D-Wave, it seems that when we show one training example, we get one sampling from the system. When we show another example, we get another sampling which has no relation with the previous one? Please correct me if I am wrong.

    • Hi David! The strategy is exactly the same as in conventional DBM training, you create batches over randomly drawn subsamples of the data to use as training examples. The batch size is a hyperparameter.

    • Hi David! Yes there are many ways to connect the underlying resources available to QCs to training Boltzmann Machines. I think the paper you’ve referenced doesn’t address a central point (actually how you prepare the states they reference). They make use of gate model QC ideas and language (such as amplitude amplification) which doesn’t work in practice. It may be possible to convert their ideas to real hardware but that would take a considerable amount of work and the ideas might break if you move away from theory into a real QC.

  9. Hi Geordie,
    May I ask, how many connections per qubit have been achieved at this moment? In our brain, on average each neuron has 1500 connections. Some neurons may have 10000 connections. The dendrites are relatively thin; therefore many long distance connections can be established. Not sure if similar things can be achieved in D-Wave. If there exists a ceiling on the connectivity, then it will be a big problem on the learning capability.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s