# Sampling from a probability distribution

Roundabouts have also been scientifically proven to be too complex for the human mind to understand.

I find lots of things confusing. For example, I don’t understand Two and a Half Men. I tried watching it a few times. I don’t get it.

Another thing I find confusing is sampling from a probability distribution. I have always had trouble with probabilities. That whole Monty Hall thing really did me in for a while. But this is a really important concept, both for quantum computers and for the future cognitive power of the critters we’re trying to build. Because it’s confusing I’d like to talk about it a bit.

Let’s start by thinking about flipping a coin. Each time we flip the coin we either get heads or tails. (If something else happens, like it lands on its edge or something, we’ll just try again). We can think of the coin flip as being for all practical purposes random, and the probability of either outcome being 50% — half the time it’s heads, the other half tails.

Let’s try to build a mathematical model of this, and let’s try to do this using a Boltzmann distribution. Let’s call the value of the coin we see when we flip it x, and let’s say x=0 corresponds to heads, and x=1 corresponds to tails. Let’s call the ‘energy’ of each outcome $E(x)$. For reasons that hopefully will become clear, let’s make the energies of both scenarios the same — $E(0) = E(1)$.

Now let’s write down the mathematical equation for the Boltzmann distribution. It is

$P(x) = {{1}\over{\cal Z}} \exp(-E(x)/T)$

where ${\cal Z} = exp(-E(0)/T) + exp(-E(1)/T)$ is the partition function and $T$ is the temperature of the distribution.

If we write this out explicitly, we get

$P(x) = {{\exp(-E(x)/T)}\over{exp(-E(0)/T) + exp(-E(1)/T)}}$

A few things you can check at this point. One is that $P(0) + P(1) = 1$, which means the probability of seeing zero (heads) plus the probability of seeing one (tails) is equal to one. The other is that in the case where $E(0) = E(1)$ the probabilities of the two solutions are equal — $P(0) = P(1) = 0.5$ regardless of the actual values of $E(0)$ and $E(1)$.

The value $P(x)$ is the probability distribution over the variable $x$, and is defined by providing the energies of all of the possible values of $x$ (there are only two for a coin flip), and also a temperature $T$.

Drawing a sample from $P(x)$ means creating a random number, and assigning a value to $x$ (either zero or one) depending on the value of the random number. In the case of flipping a fair coin, a sample is the result of a coin toss and is either 0 (heads) or tails (1), and the probability of each outcome is 50%.

A more complicated situation

Let’s say that we now have two coins, and for some strange reason when we do a coin toss (now using both), we find that they more often have the same value (both heads or both tails) than opposite values (one heads, the other tails). This would be very peculiar to see with actual coins, but we’re just doing a thought experiment so bear with me. Let’s call the values of the coins $x_1$ and $x_2$.

What it means in our model for something to be more likely is that it has a lower energy. So what we need in our model is that the energies of the two outcomes where the variables are the same needs to be lower than the energies when they are different. In other words,

$E(x_1=0, x_2=0) = E(x_1=1, x_2=1) < E(x_1=0, x_2=1) = E(x_1=1, x_2=0)$.

If we write down our Boltzmann probability distribution, we get

$P(x_1, x_2) = {{\exp(-E(x_1, x_2)/T)}\over{exp(-E(0, 0)/T) + exp(-E(1, 0)/T) + exp(-E(0, 1)/T) + exp(-E(1, 1)/T)}}$

Now a sample from this probability distribution is a pair of bits $x_1, x_2$ that occur with probability $P(x_1, x_2)$. If you plug in the relations between these energies in this example, you should be able to convince yourself that $P(0,0) = P(1,1) > P(0,1) = P(1,0)$.

Now let’s connect to Cid’s proto-brain

In the previous post, we wrote down a similar probability distribution, except it was defined over nine ‘coins’ (that is, variables that could be either heads (zero) or tails (one)). Each of these lives on its own node — one on a Visible Unit, and eight on Hidden Units.

Exactly like in the previous simple examples, we start by defining the energies of all of the possible $2^9$ outcomes of tossing these nine coins. There are $2^9$ possibilities because any outcome from nine heads (the coins reading all zeros) to nine tails (the coins reading all ones) can occur. As before, lower energies will mean more probable outcomes.

We defined the energies of all $2^9$ states to be

$E(y, x_0, x_1, ..., x_7) = y a_0 + \sum_{k=0}^7 b_k x_k + \sum_{k=0}^3 y U_k x_k + \sum_{k,p \in {\cal E}} x_k W_{k,p} x_p$

where $a_0, b_k, U_k$ and $W_{k,p}$ were as yet to be determined real numbers. Given all of these, we can plug in any combination of our nine variables and get out a real number for $E$. Lower numbers mean that combination of variables is more probable.

Here’s where we are going with this idea. Cid’s External Universe consists of a single input, which can either be zero (he sees Grumpy Cat) or one (he sees Creepy Manbaby). The entirety of this Universe is captured by the probability of seeing each of these (this is the way we set it up). In order for Cid to ‘understand’ his Universe, we need to find settings of the parameters $a_0, b_k, U_k$ and $W_{k,p}$ such that the probability distribution of Cid’s visible unit when a sample is drawn from his internal representation matches the probability distribution Cid sees when he looks out into his External Universe. If we can achieve this, Cid has created an internal representation of his External Universe that is equivalent to actually looking out into the External Universe. He will have reached Enlightenment, and will no longer need to open his eyes.

In the next post, we’ll work through how we can make this happen, solely by Cid learning about his Universe.

# Boltzmann Machines for the Grumpy Universe

In the last post, we thought a bit about machine creatures, and in particular Cid, an unfortunate who we are going to torment quite a bit over the next few posts.

Today we’ll do a little bit of construction and deconstruction of Cid. We’re going to build him a brain, and try to see how it works, and whether we can get his brain to do what we want.

To proceed, we’re going to separate out the entirety of Cid’s Universe into three distinct parts. The first is the External Universe. This will consist of everything outside of Cid. The third is Cid’s brain, which will attempt to build a model of the External Universe, and will reside entirely within Cid. The second is the interface layer between the two, which in this context you can think of as an eye. This interface layer can accept information from the External Universe, whereas his brain cannot. The brain accepts information from the interface layer, and can also send information to the interface layer. Take a look at this picture. Hopefully the idea is clear!

Cid’s high level architecture.

Here’s another way of looking at the same thing that highlights the separation between Cid and the External Universe.

I think I like this one better as it emphasizes that Cid is contained and separate from the External Universe.

This segregation is really important and is tied to some real meaty issues. If you think of your own body and how it lives in your Universe, we have the same type of architecture. Your External Universe is roughly everything outside your skin; your interface layer is roughly everything on the outside of your body; and your internal model of the world is roughly everything inside your skin (probably mostly what’s inside your skull).

In the picture there is a small orange circle. We’ll call this a visible unit. (Now we’re starting to connect to a real Boltzmann Machine. Exciting!). You can think of the visible units as vertices in a graph. They are special in our architecture, in that they are able to ‘see’ into the External Universe, and are connected into Cid’s brain. Whenever you read ‘visible units’ in the context of Boltzmann Machines, think interface layer between the External Universe and the creature’s internal representation of it. It’s the layer that separates ‘outside the creature’ from ‘inside the creature’.

Inside Cid’s brain

So far we haven’t talked at all about what might be going on inside Cid’s brain. Let’s fix that, and build an actual brain that allows Cid to understand the Grumpy Universe.

Recall that the Grumpy Universe is a very silly place, where the External Universe consists of only two possible inputs (those being Grumpy Cat and Creepy Manbaby). Now instead of actually using the images themselves, let’s simplify things a bit and represent these by a zero (for Grumpy Cat) and a one (for Creepy Manbaby). So Cid’s interface layer will only ever see a zero (our stand-in for Grumpy Cat) or a one (for Creepy Manbaby).

To build Cid a brain, let’s do the following. Let’s set up a number of nodes, like the visible unit, but hidden. We’ll call these ones Hidden Units. Here’s a picture of what a possible Cid brain could look like.

Here we have one visible unit (the orange circle) and eight hidden units (the yellow circles).

From now on, we’ll just focus on the visible and hidden units to simplify things. Here they are.

A proto-brain for Cid.

Here we’ve added a couple of things. Each of the nodes now has a label. The visible units (of which now there is only one) we’ll label $v_k$ where $k$ an integer denoting which visible unit we’re referring to. The hidden nodes are labeled $h_k$ where $k$ is again an integer referring to a specific node. We’ve (arbitrarily) chosen eight hidden nodes.

We’ve also added some black lines that connect some, but not all, of the nodes together. The connectivity pattern shown above is just one of many different ones we could pick. This particular one will turn out to be quite useful for some things I want to show you, but we could just as well have allowed all to all connectivity.

Wherever there is a black line, we introduce a real number which we call a weight. In the proto-brain above, there are four of these between the visible unit and the hidden units, and 16 of them between the different hidden units. We’ll write the weights between the visible and hidden units as $U_k$, where $k = 0, 1, 2, 3$ depending on which hidden unit is connected to. We’ll write the weights between hidden units as $W_{k, p}$ where $k$ and $p$ are the indices of the hidden units the weight connects. Here’s a picture to help make this clearer.

Here some of the weights are explicitly shown — all four U weights (connecting the visible unit to the hidden units) and three of the W weights are explicitly shown (the bold lines with the W next to them).

Now let’s assume that each of the nodes can take on one of two values — say either zero or one (it could be -1 and +1 also — any two values will do). The total number of nodes in the current architecture is 1 (visible) + 8 (hidden) = 9. Since each of these nodes can have value 0 or 1, all nine of them together can be specified with nine bits. We’ll use the convention that the leftmost bit is the visible unit, and the rightmost eight bits are the hidden units. Let’s call the value of the visible unit $y$, and the values of the hidden units $x_k$, where $k=0..7$ refers to each of the eight hidden units.

We now define the probability of any particular state of our network to be

$P(y, x_0, x_1, ..., x_7) = {{1}\over{\cal Z}} \exp(-E(y, x_0, x_1, ..., x_7) / T)$

Where

$E(y, x_0, x_1, ..., x_7) = y a_0 + \sum_{k=0}^7 b_k x_k + \sum_{k=0}^3 y U_k x_k + \sum_{k,p \in {\cal E}} x_k W_{k,p} x_p$

The probability distribution $P$ is called a Boltzmann distribution (ergo the term ‘Boltzmann Machine’). The variable $T$ is the temperature of the distribution. The quantity

${\cal Z} = \sum_{all-possible-states} \exp(-E(y, x_0, x_1, ..., x_7))$

is called the partition function, and it’s pretty much impossible to calculate (it will turn out we don’t need to!).

I’ve introduced some parameters here — $a_0$ and $b_k$ are local biases on each of the nodes. They are (as yet unknown) real numbers, just like the $U_k$ and $W_{k,p}$ weights. The notation $\sum_{k,p \in {\cal E}}$ just means only sum over the $k, p$ pairs that have an edge between them.

OK that’s enough for today. Next post we’re going to start exercising that brain!

# Boltzmann Machines & distributions of patterns from the real world

There are many excellent overviews of Boltzmann Machines. Here’s one I particularly enjoyed — you should read it!

In this overview an important concept is raised. I’d like to talk about it a bit, as I think it’s quite important to understand before we jump into describing BMs. It’s related to a bunch of interesting problems in creating intelligent machines also.

Distributions of patterns from the real world

Imagine there is a strange creature that has evolved in a particularly weird environment. Let’s call it The Chortler in Darkness, or Cid for short.

Cid has evolved to open his eyes exactly once a second for 12 hours, and then sleep for exactly 12 hours. For reasons that are probably perfectly reasonable but beyond the ken of our feeble human brains, every time he opens his eyes he sees either the image on the left, or the image on the right. This is Cid’s Universe. For him, there is nothing but this.

Cid is born into its Grumpy world having no knowledge, context or understanding. However he has eyes (he’s able to see the above images), and an instinctual need to open them to look once a second.

Now let’s say he does this for the very first time, one second after he’s born. He opens his eyes, and he sees Grumpy Cat. He closes them, and then one second later, opens them again, and sees Creepy Manbaby. He keeps doing this, and after 12 hours, he’s seen Grumpy Cat 14,567 times and Creepy Manbaby 28,633 times.

Cid has a rudimentary sort of brain. The way this brain works is that when Cid is sleeping, its job is to generate what Cid sees when Cid is awake. So instead of the outside world feeding information into Cid’s brain, via his eyes, when Cid is asleep his brain generates the exact same type of information and pushes this out from his brain onto his eyes, one image per second. You can think of this as a kind of dreaming.

If Cid’s brain can generate the same distribution of patterns that Cid’s eyes see when he’s awake, Cid’s brain has built an effective model of the Universe in which he lives, and we can say that he understands his Universe. It’s important to understand that for Cid there is no physics, chemistry, biology, language or anything else — just two images appearing randomly with some probability.

When Cid opens his eyes once a second to look out at the Universe, what he is doing is sampling from a probability distribution over all possible things that he can see. In his case, there are only two possible things he can see, and since the probability of seeing something is 1 per sample, there is only one unknown, and that is the probability of seeing one of the two (the other probability is just 1 minus whatever that is). As the number of samples from the underlying probability distribution grows, we get more and more information about the ‘true’ probability. After 12 hours, Cid saw Grumpy Cat 33.7% of the time, so his brain, when it’s generating these patterns, should contain a probabilistic model that spews out a picture of Grumpy Cat about 33.7% of the time, and Creepy Manbaby 66.3% of the time.

The role of Cid’s brain is to learn a model of his peculiar world. If we draw samples from Cid’s internal representation of the Universe (residing in his brain), we hope to get the same answers as if we were to draw samples from the real world. If we can achieve this, then Cid’s brain’s model of the world — his internal representation of his Universe — is giving the exact same behavior as if he were looking out at the real world, and his internal representation is as powerful as the ‘real world’. He no longer needs to open his eyes — it’s all the same to him to just dream all the time.

Stepping Beyond the Grumpy Universe

The other SNL Universe.

We might feel sorry for Cid, because we’re pretty sure that the ‘real’ Universe is much more complicated than the Grumpy Universe, and Cid is missing out. Because we are Benevolent Gods, we might extract Cid from his comfortable eternal dreaming and plop him down in a Strange New Land. In the SNL Universe, instead of just two possible images, there are many more — say a thousand. But he still works the way he’s always worked — once a second for 12 hours he opens his eyes, and then sleeps for 12 hours. When he’s awake, he gathers information about how many times he sees each of the images. This information is used to create an internal probabilistic model in his brain that attempts to match the probability distribution he sees in the SNL. When he’s asleep, this model generates images, and the closer this model gets to the true distribution of the SNL Universe, the closer he gets to Enlightenment and full comprehension.

Now in this case, it will take longer to get there — not only do we need to see all of the patterns at least once, we need to seem them enough to get a pretty good statistical measure of their likelihoods. But with a thousand possible patterns, it’s likely that Cid will eventually reach the point where dreaming and the SNL real world are indistinguishable. In this Enlightened state, Cid will have transcended the need to open his eyes.

The Human Experience

Some say this is Cid.

So far Cid has lived a pretty silly existence, and in fact (I forgot to mention this) he actually looks fairly silly also.

Now let’s say that instead of just a thousand images, every time Cid opens his eyes he could see any possible natural image — that is, any image that a human eye could see. He still does the same thing — every second for 12 hours he opens his eyes and looks at one of these, and then for 12 hours generates one every second from his internal model.

He keeps doing this until his internal model matches what he sees in the Real World. If he can do that, then he’s developed an internal representation of images in the Real World, and can generate them in a way that’s indistinguishable from actually opening his eyes and looking at natural images.

Interestingly, there are Cid-like creatures in the world already. Unfortunately, just being able to understand a Universe of natural images isn’t nearly enough to create a creature with human like cognition. But the progress in understanding how creatures can develop internal representations of parts of our Real World is real progress towards that objective I think.

# Quantum Boltzmann Machines

Sometimes I think about top ten lists. Like what my top ten favorite songs of all time would be, or my top ten favorite books. You can probably tell a lot about someone by what would be on those lists. I once did a personality profiling procedure, which took the answers to about one hundred multiple choice questions and put the respondent into one of 25 72 bins. [Note: I just found the document, there were 72 bins. It was called the Insights Discovery Profile. I was in Bin #22, AKA "Reforming Director"]. The bin I was in was an eerily accurate description of me. I think of this procedure as ‘human dimensionality reduction’. It’s like PCA!

One of my top ten books is On Intelligence by Jeff Hawkins. Jeff was the founder of Palm and Handspring. If you’re not in a hardware company, there is an important fact I will share with you of which you may be unaware. Building hardware that works in the real world is a special kind of hell. I suspect Dante originally had a level of the Inferno where you had to just make carts or butter churns or whatever, but it was so painful to think about he took it out. So I feel a sort of esprit de corps with anyone who has suffered through that special kind of torture.

I first read On Intelligence in 2008, and it was my first exposure to two important ideas. They are:

1. Intelligence is related to, and can even be defined to be, an entity’s ability to build an internal representation of the world, and correctly predict the outcomes of its possible actions within that model.
2. Mammalian brains contain a structure, called the neocortex, that allows mammals to build models of the world out of the same repeating physical structure, tiled out a huge number of times. This repeating structure is hierarchical, and allows mammals to efficiently build representations of the world that are also hierarchical.

In my next series of posts I want to show you something related to both of these points. It’s a fascinating way to connect what D-Wave hardware does to the bleeding edge of machine learning.

# NASA Experts Available For Interviews About Quantum Computing | NASA

This is cool! Check out the link for a resource that can help answer questions about D-Wave quantum computers from the perspective of our users at NASA.

# Lockheed Martin Tweet Chat: #QuantumChat

Lockheed Martin is hosting an interesting event, which is linked to here. It’s an opportunity to talk to people who are working with actual D-Wave quantum computers. If you have questions, now you can have them answered by people who actually know what they are talking about. What a concept! Exciting! Here’s a brief summary, from the event page linked to above:

Join quantum computing experts from Lockheed Martin, the University of Southern California and D-Wave Systems as they “borrow” their companies’ Twitter accounts to discuss the latest in speedy qubits and the quantum evolution.

Tweet your questions to @LockheedMartin, @USCViterbi or @dwavesys with the #QuantumChat hashtag starting Nov. 7. @LockheedMartin will moderate the chat and pose questions beginning at 1 p.m. EDT on Thursday, Nov. 14. Questions will be selected from those tweeted with the #QuantumChat hashtag between now and the end of the chat.

Can’t follow along with the Tweet Chat live? Watch for the full chat transcript on our Storify page.