Trying out GloboBoost (we have to find a better name, any suggestions?) on a benchmark dataset

The UC Irvine machine learning repository is a wonderful source of all kinds of benchmark data sets for trying out machine learning algorithms. It is a really great site, easy to navigate and browse through.

From the data sets on this site, I wanted one that had a small number of dimensions, so I picked the Haberman Survival data set to try.

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

The class attribute (the thing we’re trying to predict) is:

Survival status (class attribute)
— 1 = the patient survived 5 years or longer (positive examples)
— 2 = the patient died within 5 years (negative examples)

This data set has three dimensions of data:

  1. Age of patient at time of operation (numerical)
  2. Patient’s year of operation (year – 1900, numerical)
  3. Number of positive axillary nodes detected (numerical)

From these the classifier tries to predict the survival status of the people receiving the surgery.

Here is a scatter plot of the data:

Haberman_data

Here is another view showing each data point labeled positive (blue) and negative (red).

Haberman_data_sorted

So to see how well GloboBoost does, we first have to modify the code a bit to accept training data that has unequal numbers of positive and negative examples. Here the data set has 225 positives and 81 negatives in total. What we’ll do is randomly pick 1/2 of the points to be the training set and the other half the testing set.

Hmm, interesting. It turns out that this data set is not linearly separable. What this means is that with only linear weak classifiers (as we have in the current implementation), the theoretical best that you can do is the same as always guessing positive, ie. 73.5%. I ran my new modified GloboBoost code on the Haberman data set and it returned the theoretical best classifier accuracy, which is good, but not being able to do better than always guessing positive regardless of the input data is not so good! This is a symptom of our currently limited dictionary of (first order) weak classifiers. We’ll have to come back to this one when we have a better dictionary! Let’s try a different data set in the next post.

2 thoughts on “Trying out GloboBoost (we have to find a better name, any suggestions?) on a benchmark dataset

  1. If it’s a Global Optimization Booster I think you should go with GoBoost. It’s short and punchy. And it is reminiscent of a space shuttle launch…

  2. A really feeble data set for a really complicated problem. I am not surprised you did not find any correlation. Perhaps as an exercise for the program it may have been useful, but I would have been very skeptical if you had found any correlation between the elements of this data set. I mean, all you have to do is look at the graph and you can see that the results are completely random (assuming you had a 3-D view you could turn and view from any angle).

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s