The UC Irvine machine learning repository is a wonderful source of all kinds of benchmark data sets for trying out machine learning algorithms. It is a really great site, easy to navigate and browse through.
From the data sets on this site, I wanted one that had a small number of dimensions, so I picked the Haberman Survival data set to try.
The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer.
The class attribute (the thing we’re trying to predict) is:
Survival status (class attribute)
— 1 = the patient survived 5 years or longer (positive examples)
— 2 = the patient died within 5 years (negative examples)
This data set has three dimensions of data:
- Age of patient at time of operation (numerical)
- Patient’s year of operation (year – 1900, numerical)
- Number of positive axillary nodes detected (numerical)
From these the classifier tries to predict the survival status of the people receiving the surgery.
Here is a scatter plot of the data:
Here is another view showing each data point labeled positive (blue) and negative (red).
So to see how well GloboBoost does, we first have to modify the code a bit to accept training data that has unequal numbers of positive and negative examples. Here the data set has 225 positives and 81 negatives in total. What we’ll do is randomly pick 1/2 of the points to be the training set and the other half the testing set.
Hmm, interesting. It turns out that this data set is not linearly separable. What this means is that with only linear weak classifiers (as we have in the current implementation), the theoretical best that you can do is the same as always guessing positive, ie. 73.5%. I ran my new modified GloboBoost code on the Haberman data set and it returned the theoretical best classifier accuracy, which is good, but not being able to do better than always guessing positive regardless of the input data is not so good! This is a symptom of our currently limited dictionary of (first order) weak classifiers. We’ll have to come back to this one when we have a better dictionary! Let’s try a different data set in the next post.