III. A dictionary of weak classifiers

In the previous post I talked about how we will think about data. Now I want to discuss the next item on the list, which is how we structure and construct the desired classifier.

The first key concept is that of a weak classifier. A weak classifier is a function h that acts on an instance \vec{x} and returns a label \pm 1, where the performance of the classifier can be quite poor. You can think of a weak classifier like a person who doesn’t know very much about the classification task and is pretty much guessing.

Here is an artist’s depiction of a weak classifier:


The desired classifier is a weighted sum over a dictionary of weak classifiers. The classification function is y(\vec{x})=sign(\sum_j w_j h_j(\vec{x})). y is called a strong classifier. The role of the learning phase of the algorithm will be to select the weights w_j to produce the final strong classifier.

It turns out that even if the weak classifiers h_j are all very poor, this type of linear combination can lead to an excellent strong classifier… very much a “wisdom of the crowds” type thing. Given a large number of weak classifiers, the training phase determines how much you should listen to each, and combines all that advice to “vote” on whether a new object should be classified +1 or -1. Here is a picture featuring my girlfriend Hayden as an example. The learning algorithm decides how we should weight each of the weak classifiers’ opinions.


This classification function and its underlying philosophy (combining a large number of weak classifiers into a good strong classifier) is common to all boosting approaches (and democracy, although in democracy you unfortunately don’t get to weight people’s votes). One difference in our approach is that we will restrict the weights w_j to be binary–either 0 or 1–whereas in most boosting approaches these are real numbers. At first glance this might seem to restrict the power of the strong classifier, but it turns out that it doesn’t (much). A beneficial side-effect of this restriction is that the size of the strong classifier can be significantly reduced. Instead of storing a bunch of reals, all you need are the same number of bits.

Before I provide an actual functional form for the weak classifiers I want to use, I want to give a more conceptual picture of what is going on with the weak vs. strong classifier story. Image you have a room full of people, say 32 of them. (These are our weak classifiers). Assume none of these people read Farsi or Arabic. Now show them 100 pages of Arabic text and 100 pages of Farsi text and ask them to tell you one by one whether it’s Farsi or Arabic. Each one of them will likely get it right about 50% of the time. Now let’s say we change it up so that each of the 32 gets a vote, but weighted by how well they did on the first 200 test cases (their vote counts for more if they did better), and give them another 200 cases to classify. Then the combined performance of the 32 voting will be much better than any individually. This combined performance is the strong classifier.  Wisdom of the crowds.

Back to the generic case. In the form for the strong classifier I gave above, after the learning phase (where we show the classifier a set of positive and negative examples), we get N binary numbers w_j which (together with the fixed dictionary of weak classifiers) specify the desired strong classifier. The entire learning exercise is to find which set of weak classifiers to turn on (w_j=+1) and which to turn off (w_j=0). Once you’ve done this, then you can load the classifier on your cell phone and use it–no more heavy duty computation required.

The fixed dictionary of weak classifiers I am going to use in this application is very simple. Here is how it is defined.

For each dimension d=1..M there are two weak classifiers, denoted h_d^{+}=sign(x_d-\Theta_d^{+}) and h_d^{-}=sign(-x_d-\Theta_d^{-}), where \Theta_d^{+} and \Theta_d^{-} are real numbers that we must calculate subject to a prescription I’ll give shortly, and x_d is the d^{th} component of \vec{x}. Therefore our fixed first order dictionary has 2M weak classifiers.

Prescription for calculating \Theta_d^{+} and \Theta_d^{-}

The possible values that \Theta_d^{\pm} can take are determined by the following simple prescription.

  1. For every dimension d
  2. Make a sorted list of all S values x_d for all elements \vec{x} in the training set, including both negative and positive examples
  3. The possible values of \Theta_d^{\pm} are then the mid points between consecutive entries of this list that have different values

Here is some Matlab code that does this, assuming you’ve created the bluepositives and rednegatives arrays from the previous post.



% Determines the possible $\Theta$ values.

for d=1:M

datalist(:,d)=cat(1, bluepositives(:,d),rednegatives(:,d));


for k=1:S-1 possiblethetas(k,d)=(sorteddatalist(k+1,d)+sorteddatalist(k,d))/2; end;



The possible \Theta_d^{\pm} values are now stored in the possiblethetas array.

Now that we have a set of possible values, we need to choose one. We do this using the following prescription.

  1. For every dimension d
  2. For every possible value of \Theta_d^{\pm} (there are S-1 of these)
  3. Compute the classification error of weak classifier h_d^{+} assuming \Theta_d^{+}=possibletheta(k,d)
  4. Compute the classification error of weak classifier h_d^{-} assuming \Theta_d^{-}=possibletheta(k,d)
  5. Select the value of \Theta_d^{+} that gives the lowest classification error
  6. Select the value of \Theta_d^{-} that gives the lowest classification error

So we try out a total of 2*M*(S-1) possible weak classifiers and from these select 2*M of them.

Here is Matlab code that does this.



% Code for finding the Theta parameters that minimize classification error.


for d=1:M

for k=1:S-1     % look at each possible theta value

errplus(k)=0; errminus(k)=0;
for p=1:S/2 % look at each data point to see if it’s classified correctly

%   Start with blue points. For these we get a classification error if the
%   weak classifier guesses negative.

if (hp<0) errplus(k)=errplus(k)+bluemisclasscost(p); end;
if (hm<0) errminus(k)=errminus(k)+bluemisclasscost(p); end;

%   Same with red points. For these we get a classification error if the
%   weak classifier guesses positive.


if (hp>0) errplus(k)=errplus(k)+redmisclasscost(p); end;

if (hm>0) errminus(k)=errminus(k)+redmisclasscost(p); end;


[B,IX]=sort(errplus); thetaplus(d)=possiblethetavalues(IX(1),d);
[B,IX]=sort(errminus); thetaminus(d)=possiblethetavalues(IX(1),d);



After this the desired values of \Theta_d^{\pm} will be stored in the thetaplus and thetaminus arrays. Here is a two-dimensional (M=2) example showing \Theta_1^{\pm} and \Theta_2^{\pm}.


9 thoughts on “III. A dictionary of weak classifiers

  1. Thanks! I am going to work through the whole method step by step and tighten up the posts as I go to make them clearer/better. For each step I’m going to provide matlab code so people can do it themselves to try it out to see how it works.

  2. Awesome posts, and thanks for providing some of the code. I look forward to seeing this tested on your hardware.

    Also, you’re a brave guy to be calling your girlfriend a weak classifier…

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s