I find lots of things confusing. For example, I don’t understand Two and a Half Men. I tried watching it a few times. I don’t get it.
Another thing I find confusing is sampling from a probability distribution. I have always had trouble with probabilities. That whole Monty Hall thing really did me in for a while. But this is a really important concept, both for quantum computers and for the future cognitive power of the critters we’re trying to build. Because it’s confusing I’d like to talk about it a bit.
Let’s start with something easy
Let’s start by thinking about flipping a coin. Each time we flip the coin we either get heads or tails. (If something else happens, like it lands on its edge or something, we’ll just try again). We can think of the coin flip as being for all practical purposes random, and the probability of either outcome being 50% — half the time it’s heads, the other half tails.
Let’s try to build a mathematical model of this, and let’s try to do this using a Boltzmann distribution. Let’s call the value of the coin we see when we flip it x, and let’s say x=0 corresponds to heads, and x=1 corresponds to tails. Let’s call the ‘energy’ of each outcome . For reasons that hopefully will become clear, let’s make the energies of both scenarios the same — .
Now let’s write down the mathematical equation for the Boltzmann distribution. It is
where is the partition function and is the temperature of the distribution.
If we write this out explicitly, we get
A few things you can check at this point. One is that , which means the probability of seeing zero (heads) plus the probability of seeing one (tails) is equal to one. The other is that in the case where the probabilities of the two solutions are equal — regardless of the actual values of and .
The value is the probability distribution over the variable , and is defined by providing the energies of all of the possible values of (there are only two for a coin flip), and also a temperature .
Drawing a sample from means creating a random number, and assigning a value to (either zero or one) depending on the value of the random number. In the case of flipping a fair coin, a sample is the result of a coin toss and is either 0 (heads) or tails (1), and the probability of each outcome is 50%.
A more complicated situation
Let’s say that we now have two coins, and for some strange reason when we do a coin toss (now using both), we find that they more often have the same value (both heads or both tails) than opposite values (one heads, the other tails). This would be very peculiar to see with actual coins, but we’re just doing a thought experiment so bear with me. Let’s call the values of the coins and .
What it means in our model for something to be more likely is that it has a lower energy. So what we need in our model is that the energies of the two outcomes where the variables are the same needs to be lower than the energies when they are different. In other words,
If we write down our Boltzmann probability distribution, we get
Now a sample from this probability distribution is a pair of bits that occur with probability . If you plug in the relations between these energies in this example, you should be able to convince yourself that .
Now let’s connect to Cid’s proto-brain
In the previous post, we wrote down a similar probability distribution, except it was defined over nine ‘coins’ (that is, variables that could be either heads (zero) or tails (one)). Each of these lives on its own node — one on a Visible Unit, and eight on Hidden Units.
Exactly like in the previous simple examples, we start by defining the energies of all of the possible outcomes of tossing these nine coins. There are possibilities because any outcome from nine heads (the coins reading all zeros) to nine tails (the coins reading all ones) can occur. As before, lower energies will mean more probable outcomes.
We defined the energies of all states to be
where and were as yet to be determined real numbers. Given all of these, we can plug in any combination of our nine variables and get out a real number for . Lower numbers mean that combination of variables is more probable.
Here’s where we are going with this idea. Cid’s External Universe consists of a single input, which can either be zero (he sees Grumpy Cat) or one (he sees Creepy Manbaby). The entirety of this Universe is captured by the probability of seeing each of these (this is the way we set it up). In order for Cid to ‘understand’ his Universe, we need to find settings of the parameters and such that the probability distribution of Cid’s visible unit when a sample is drawn from his internal representation matches the probability distribution Cid sees when he looks out into his External Universe. If we can achieve this, Cid has created an internal representation of his External Universe that is equivalent to actually looking out into the External Universe. He will have reached Enlightenment, and will no longer need to open his eyes.
In the next post, we’ll work through how we can make this happen, solely by Cid learning about his Universe.