# Learning to program the D-Wave One: building a dictionary of weak classifiers

Hello everyone! In my last blog post I introduced the concept of labeled data. Now I’m going to talk about the concept of weak classifiers.

What’s a weak classifier?

A weak classifier is a function $F(x)$ that is applied to the $x$ objects in your labeled data, which returns either +1 or -1 (in other words, a yes or no on the thing you’re trying to classify). Weak classifiers are sometimes referred to in machine learning as features. You need to provide these to the quantum binary classification learning algorithm we’re (slowly :-)) implementing here (there are more sophisticated machine learning methods that can learn features also; we will encounter these in later modules).

These functions can be as simple or as complex as you like; the only restrictions on them is that they have to take as input objects of type $x$ and output +1 or -1. You can think of the returned number as a “guess” that the weak classifier is making as to the correct label for the object. We will denote the kth weak classifier as $F_k(x)$, where k is an index that runs from 1 to D, where D is the total number of weak classifiers we wish to use. The set of all we will refer to as the dictionary of weak classifiers.

Selecting weak classifiers that “guess” well can be very hard, especially if you don’t have good intuition for what structures in your data are correlated with your labels. Getting a good set often involves trial and error and experimentation. Often a particular type of input object will have a lot of established prior art in what types of weak classifiers should be good starting points. Sometimes you will have to guess at what a good set might look like.

For our gender classifier, we want to set up the functions $F_k(x)$ so that they make guesses at what gender the name stored in the $x$ variable might be. One type of feature that might work is the last letter of the name – names ending in ‘a’ and ‘e’ might be more likely to be female, while those ending in ‘n’ might be more likely to be male.

If we go with this type of set, we can define a total of D=52 weak classifiers as follows (python source that implements this is included at the end of the post):

For example, the function $F_9(x)$ outputs +1 (a guess that the name has gender ‘female’) if the last letter of is ‘e’, and -1 (a guess that the name should be classified as ‘male’) if it’s not. Recalling our earlier data, this is how $F_9(x)$ would guess on the following names:

This classifier got two right (‘Mohammad’ is actually ‘male’ and ‘Suzanne’ is actually ‘female’) and two wrong (‘Geordie’ is not ‘female’ and ‘Erin’ is not ‘male’).

Note that you can use as many weak classifiers as you like in your dictionary. For example it’s straightforward to add to the above dictionary by adding weak classifiers that check for combinations of the last two letters in a name, the length of a name, the first letter of a name, and many other possibilities. You are limited here only by your creativity.

A related issue to consider is that the data you have might not be in the best format for building a good classifier. For example, we’ve assumed that there is enough structure in the letters used in a proper name to make a good guess at the name’s gender. Whether this is in fact the case needs to be empirically tested. Perhaps there is a different way to store the data representing your objects – for example, it might be possible that the phonetic pronunciation of a name gives more clues as to its likely gender that its spelling. If this were the case, it might be better to store each of your names as sets of phonemes instead of letters, and build a dictionary of weak classifiers that look for phonetic properties instead of the presence or absence of specific letters.

We will see that using the very simple ‘last letter’ dictionary we built in Algorithm 2 is enough to build a classifier that performs at about 75% accuracy in labeling names in a typical test set. It is likely that you can do better than this, with some careful thought and insight in your choice of encoding of your data, and your choice of weak classifiers.

Now as I mentioned in the previous post, I recognize that for some of you building a classifier that labels names with most likely gender may not exactly turn your crank. That’s OK! If you have a different data set you’d like to work with (perhaps you found a neat one at the UC Irvine machine learning repository), you have an interesting assignment at this stage. You need to come up with a dictionary of weak classifiers $F_k(x)$ that constitute your best guess at a good way to distinguish the quantity you’re trying to classify.