I have a childhood memory. I remember picking dandelions, and singing “Mommy had a baby and her head popped off”, while thumb-flick-decapitating the dandelion.
Yesterday this little slice of life came up in conversation with Suzanne. She had similar memories but thought it went “Miss Polly had a dolly and her head popped off.”
For the first time in about 35 years it occurred to me that what I remembered makes no sense. Why would mommy’s (or for that matter the baby’s, if you parse the sentence differently) head pop off? However a dolly’s head popping off is entirely sensible, and it rhymes!
Then I started to wonder if the elaborate network of memories I have around this important childhood memory might all just be fabricated. So I looked for the Miss Polly version, and lo and behold, in the UK, it’s exactly like she remembered. However where I grew up, it’s the Mommy version that dominates. I tried to find the origins of this and failed. It seems no-one knows where it came from. I’m speculating, but I suspect that the Miss Polly nursery rhyme is Very Old, probably originating in the UK sometime in the middle ages, and the version I remember is a mutation arising in North America. How it mutated into Mommy I could not source.
Isn’t it interesting to think that there are some True Facts About the World (such as the existence of both versions of this dandelion-unfriendly activity) that neither I nor Suzanne knew about, even though we knew enough to know one of them? In the earlier posts about Cid, I discussed the concept that we wanted his internal model of the universe (the one in his brain) to match as closely as possible his external universe. In this case, neither my nor Suzanne’s internal models matched the reality of the external universe accurately. In doing this research I feel that I’ve augmented my internal model a little bit. Now you can ask me anything about Miss Polly and her dolly and quickly regret doing so.
How intelligent is a thing?
When you try to build a thing, you need measures to determine how well you’re doing. This is really important. Often you need to choose between multiple paths forward, and being able to assign a set of numbers to ‘how good’ each design is allows you to make reasonable decisions about which paths are better. If we want to build intelligent machines, we need to reduce what we mean by ‘intelligence’ to a set of numbers. This means having a formal mathematical definition of what we mean when we say that word.
People who study intelligence have come up with large numbers of definitions of what the word means. Here’s a review paper from 2007 that contains about 70 of these. If you ask ‘by how much has my intelligence increased, now that I know a little bit more about Miss Polly?’, exactly zero of these answer this question. None of them are capable of producing numbers.
With biological entities living in the ‘real world’, it’s sensible that it would be very difficult to precisely define what we mean by intelligence. It’s just all so complicated! But we might be able to do this for Cid, and the reason is that we have the power to vastly simplify his Universe. And anyway, it’s a necessary condition of trying to build intelligent machines that we need to have a mathematical definition of intelligence. So let’s take a cut at this and see if we can come up with something sensible.
How Intelligent is Cid?
Let’s say we build two versions of Cid, both of which are exposed to exact copies of the same External Universe (we’ll use EU for short — this is the full and complete extent of the Universe that they can measure using their sensors) and both are doing something. We watch them doing whatever they are doing. Can we then measure which is more intelligent? How could we do this?
We could in principle do whatever we wanted to build Cid’s brain. However for now we’re going to restrict the type of brain we’re going to build to be a Boltzmann Machine of the sort we’ve been discussing in the previous posts. Boltzmann Machines are a type of generative model, which work by trying to match the probability distribution over states of the EU to the probability distribution over these states generated by the entity’s brain.
Here’s how we are going to quantify the idea of two probability distributions being ‘similar’. We’re going to use something called the Kullback–Leibler (or KL-) divergence. It is a measure of the information lost when one probability distribution (say the one inside Cid’s brain) is used to approximate another (say the real probability distribution of the EU). The KL-divergence can be used to define a quantitative intelligence measure for Cid.
Let’s define the probability distribution coming from Cid’s brain to be , and the probability distribution from the EU to be . Then the KL-divergence is
where are the possible states of Cid’s Visible Units. We can formally define Cid’s intelligence to be the inverse of the KL-divergence, so as his model gets better his intelligence will increase, and will go to infinity as it nears perfect understanding of his EU.
It’s important that this definition of intelligence is explicitly defined relative to the entity’s EU, and in fact only means something when you keep that in mind. It’s a measure of how well the entity has been able to build an internal representation of what he’s capable of observing. Two entities can only be directly compared to each other using this metric if they have identical EUs. [As an aside, you can also use this to compare two different internal representations — how ‘similar’ two Cid brains are to each other, which is very interesting in its own right].
Generally when we think about intelligence, we have a prejudice that it’s something absolute, and clearly humans have more of whatever it is. The definition above challenges this position somewhat, in that we really need to take into account that different creatures can have dramatically different Universes in which they are submerged and dramatically different sensors that give them information about it. The EU of a 24-eyed jellyfish is very different from that of a two-eyed land-dwelling omnivorous hairless ape. Our prejudice is that our EU and our models of it constitute some kind of superior thing to the jellyfish’s — presumably the jellyfish’s are just a tiny subset of ours. Maybe this is true. But maybe not.
We’re going to refer to the variety of numerical quantifications of intelligence we’ll come up with as S-numbers. This is because the idea of coming up with a series of numbers to quantify the intelligence of the machines we’re building comes from Suzanne. This particular one we’ll call .
KL-divergence and in the Grumpy Universe
The first EU we will show to Cid will be the Grumpy Universe. Recall that this Universe can be thought of as comprising a single bit , where we as omnipotent gods get to arbitrarily set the probability of seeing the states of that bit. Let’s say that we choose the probability of the bit being zero (Cid opens his eyes and sees Grumpy Cat) to be . This of course fixes the only other possibility (the bit is one — Cid opens his eyes and sees Creepy Manbaby) to be .
Once we have fixed these, we can write out the KL-divergence explicitly as
and the number is the reciprocal of this
The number diverges when the entity’s model of the EU is perfect. We’re going to call this state Enlightenment. The state of Enlightenment is always defined relative to a specific EU. Our objective will be to allow Cid to become Enlightened, in a series of increasingly complex EUs.
The entirety of Cid’s intelligence comes down to a single number — the probability of his internal model generating a zero when Cid is dreaming. Let’s see what the KL-divergence function looks like.
You can see that it goes to zero around the ‘correct’ value of 0.135, and is convex.
Cid’s Boltzmann Machine Brain
Recall that we chose a specific architecture for Cid’s brain, which consisted of eight Hidden Units and one Visible Unit. Here it is.
Cid starts by not having any way to know what any of the free parameters in this model should be. If we were to just randomly set all of them, and allow his brain to reach thermal equilibrium at a fairly low temperature, and we were to draw samples from the resultant probability distribution, the probability of the Visible Unit being zero will just be some random number between zero and one — he’s completely disconnected from his EU. So let’s say we were to do this and then measured this probability to be, say, 0.645. Looking at the chart for KL-divergence, this gives about 0.3, the inverse of which is about 3. So the number — the intelligence — of this random creature would be about 3.
Of course we don’t want to build creatures that don’t interact with their environments. We want them to learn from them. We want them to become Enlightened. And thankfully the Boltzmann Machine comes with a prescription for adjusting all of its parameters to decrease the KL-divergence (and thereby increase Cid’s intelligence). By following this prescription, Cid can become smarter by looking around at his world and increasingly understanding it.
In the next post, we’ll actually do the training! If we can succeed, Cid’s number will diverge and he’ll have complete and utter understanding of the Grumpy Universe.