Imagine you are given the task of creating code that will input some (potentially very complex) object, and automatically label it with one of two possible labels. Here are some examples:
- Given an image, is there an apple in the image or not?
- Given a chest x-ray, is there a tumor in the image or not?
- Is a piece of code malware or not?
- Given a movie review, is it positive or negative?
- Given only a person’s first name, is the person more likely to be male or female?
How can we go about building the software that will apply these labels? There are many ways that you could try to do this. One of these, which we’ll focus on here, is a basic machine learning approach called supervised binary classification.
This approach requires you to prepare two (and only two!) things before you can implement a classifier. These are labeled data and a set of weak classifiers. In this post I’ll focus on what labeled data is, how to get it ready for our procedure, and places where you can find very cool curated data sets.
If we want a machine to be able to learn to recognize something, one way to proceed is to present the machine with large numbers of objects that have been labeled (usually by humans) to either include the thing we are looking for or not, together with a prescription allowing the machine to learn what features in the object correlate with these labels.
For example, if we want to build code to detect whether an apple is in an image, we could generate a large set of images, all with apples in them, and label these images with the word “apple”. We could also generate a large set of images, none of which had apples in them, and label each of these “no apple”. A good machine learning algorithm might then detect a correlation between having red in an image, and having an “apple” label. The algorithm could learn that images that contain red things are more likely to contain apples, based on the examples we’ve shown it.
This type of approach, which depends on having large numbers of properly labeled examples, is called supervised machine learning.
The notation we will use is that the potentially very complex object, to which the label is assigned, is denoted x . x can be any representation of the object that you feel makes sense, given the type of object you are dealing with. Sometimes it makes sense to think of x as a vector of real numbers. For example, if you are dealing with images, x could be the values of the pixels, or the values of color histograms, or the coefficients of some type of wavelet transform. If you are dealing with natural language, as we will be in the example we’ll build here, x is often a string.
The label that has been assigned to that object we will denote y . We will use the convention that , with these two different labels referring to the two different possibilities we are looking for our binary classifier to select.
Every item in our labeled data will be a pair of the form (x, y). The total number of items we have access to we’ll call N . If we want to build a system that learns to classify the most likely gender of a person‘s first name, labeled data could look like this:
In this example the x variables are strings holding a name, and the y values hold the most likely gender for that name, encoded so that +1 means “female” and -1 means “male”.
Note that while this example looks fairly simple, in general you can put anything you like in the x slot. If you were building a binary classifier that labeled a novel as to whether it was written by Dean Koontz or not, you might put the entire text of the novel in this slot. If you were interested in labeling images, x could be the pixel values in an image.
Once we have access to this labeled data, we need to perform a set of simple operations on it in order to proceed. These operations separate the available data into three separate groups, which are called the training set, validation set and test set.
The training set is the set of data that you will use to build your classifier; it is the subset of your data that the algorithm uses to learn the difference between objects labeled +1 and objects labeled -1.
The validation set is the set of data that you will use to gauge the performance of a potential classifier, while your training algorithm is running.
The test set is the set of data that you will use to test how well the best classifier you found could be expected to perform “in real life”, on examples it hasn’t seen yet.
There are procedures that have been developed for how to best slice your available labeled data into these categories. In the example we’ll implement here, we won’t use these (but if you’re coming at this with a machine learning background, you should!). If you are going to build industrial strength classifiers you will need to take some care in how you segment your available data, and you will encounter issues that need some thought to resolve. If you would like me to post anything about these let me know in the comments section.
Because we‘re focusing on showing the basic principles here, we‘ll just use a very simple procedure for segmenting our data. Here is how it works:
The first two steps ensure that the data we are keeping has equal numbers of +1 and -1 examples. Note that for this to work exactly as laid out K has to be even and R has to be divisible by four. Segmenting everything exactly like this isn‘t necessary. You are free to choose the sizes of your test, training and validation sets to be anything you like.
In our implementation, K was 2,942 (this was the number of male names in the NLTK “names” corpus, which is considerably smaller than the number of female names), and R was chosen to be 100. For these choices, the training set contains S = 2,892 elements (half labeled +1 and half labeled -1), the validation set contains V = 2,892 elements (half labeled +1 and half labeled -1), and the test set contains R = 100 elements (half labeled +1 and half labeled -1).
Here are links to a Python function that implements Algorithm 1 (note: I had to change the file extension to .doc as wordpress doesn’t like .py. Change the extension to .py) as well as the training_set, validation_set and test_set (all in cPickled form — note again change extension to .txt, wordpress doesn’t like .txt either?) we’re going to use to build our classifier.
Question:: anyone have a better way of sharing source code on wordpress?
Try it! At this point it makes sense to go over the source code for Algorithm 1 linked to above, and understand what it’s doing. If your Python is rusty, or you’ve never used NLTK, this is a good place to get your mojo going. Ask me questions if something doesn’t work!!!
Finding Your Own Labeled Data
The example we’re going to build here, which builds a classifier that labels a person’s name with “male” or “female”, is fun, easy and will give you a good idea for how the quantum binary classification procedure works. But, unless you are obsessed with natural language like I am, you might find it a little too “toy”. You might want to build a very different binary classifier! You can — anything you can think of that would benefit from automatically applying a binary label can work. If you’d like to follow along with these posts, using a totally different dataset, that would be great! I can try to help you if you get stuck.
The best place to start looking for publicly available datasets is the UC Irvine machine learning repository. There is a lot of very cool stuff there.
If you know of any other places where there are curated datasets that can be used for machine learning, please link to them in the comments section!