Usually when we want to build a classifier the objects to be classified are somehow distilled into some compact and meaningful representation. For example, let’s say the objects in question are elephants, and we want our classifier to act on these objects to assign a +1 to the elephant if it will fit in your car and -1 if it won’t. Elephants are highly complex entities. However much of that complexity is not relevant to this classification task. It’s color, for example, probably doesn’t matter. It’s size, though,should matter quite a bit. In fact, all of the complexity of the elephant might be boiled down to a single number in this case: what is the ratio of the elephant’s volume to your car’s volume. The vast list of other features of the elephant probably don’t matter much, at least for this particular task.

In what follows, we will treat the objects to be classified as vectors of real numbers. Sometimes this can get a bit abstract. But remember that in real life those numbers are (hopefully) some smart “boiled down” version of the actual object.

With this in mind, for us an instance will generally be an M-dimensional vector of real numbers, . By construction each instance will be one of two types. The first type are positive examples. When our binary classifier looks at one of these, it should label it +1. The second type are negative examples. These should be labeled -1. If the classifier labels a positive example as -1 or a negative example as +1 it has misclassified that example.

Here is an example. Let’s say we are given an animal, and we know it’s either an elephant or a crocodile. We want to build an automatic classifier so no human needs to be in the loop. The human programmer might think like this: why don’t we put them on the scale, and classify the animal by weight? Following this logic, we can create an M=1 (one dimensional) classifier where is a number, in this case the mass of the animal in kilograms.

We then find (say) 10 elephants and 10 crocodiles, where we know which is which, and plot their masses. Call the elephants positive examples (blue data) and crocodiles negative examples (red data). Here is what the data points might look like:

This appears to be a good way to build a classifier, as the elephants (blue data) and crocodiles (red data) have quite different masses in the training set we looked at. The way we could then use this to build a classifier is to find a mass midway between that of the biggest crocodile and smallest elephant–say about 3,300 kilograms–and then classify the incoming animal according to whether its mass was greater than or less than this amount.

We can do the same thing with more dimensions also. Let’s say we wanted to accomplish a more difficult task where having more information about different aspects of the object could help. For example, let’s say we need to label an incoming animal either “Nile Crocodile” or “Saltwater Crocodile”. These are a little closer in mass, so in addition to mass we’ll use length also. Then our training data, using 10 crocodiles of each type, might look like this:

Here the black line divides the two-dimensional region into “Nile crocodile” (red section) and “Saltwater crocodile” (blue section) parts based on inputs consisting of mass and length for animals in the training set. Note that this choice of classifier mis-classifies two of the Saltwater crocodiles (the two blue points in the red section), so we would expect that mistakes would be made by this particular choice.

These two examples use “natural data”: data points that arise from some real classification task. I want to now focus on synthetic data. In particular we will focus on data generated using the following prescription:

- Choose the number of dimensions M that you wish the data to live in.
- Choose the total number of training data points S (take S to be even, as we will generate S/2 positive and S/2 negative examples).
- The data we will generate will be Gaussian distributed clumps. The positive examples (which we will always denote by blue dots) are characterized by a mean value -mv/2 and standard deviation sdb in each of their M dimensions. The negative examples (red dots) are characterized by a mean value +mv/2 and standard deviation sdr in each of the M dimensions. Here is some Matlab code for generating this type of synthetic data:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%

% Code snippet for generating synthetic data.

M=2; S=100; sdb=1; sdr=2; mv=10; bluepositives=-mv/2+sdb*randn(S/2,M); rednegatives=+mv/2+sdr*(randn(S/2,M);

%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The data points are then stored in the arrays bluepositives and rednegatives. Here is sample output of this code:

Because we picked the separation between the two clumps to be larger than their standard deviations the two separate out nicely. However if we run the same code with mv=0 we get:

One way to think about the difference between these two pictures is that in the first we have been smart (or lucky) enough to pick the axes so that they are already great classifiers by themselves. In the second example, the quantities plotted on the axes are not by themselves good classifiers.

25 years ago we used algorithms like ID3 to build classifier decision trees from large tabular data sets. We had to cope with nominal, ordinal and (noisy) real data. Nevertheless someone (human?) had decided in advance which columns to record for the data sets.

What we really wanted was some way of suggesting which columns to record in the data sets (=design of new experiments, theory-generation). CYC seemed atr the time to be the only feasible way to go here, which Doug did. Can you point me to papers about current progress in this area?