# Learning to program the D-Wave One: loss function, regularization and some notation

The Loss Function

A loss function is a measure of how well a choice for a set of weak classifiers performs. Let’s start by quantifying how we encode a particular combination of weak classifiers. Given a dictionary with D weak classifiers, define a D-dimensional vector of binary numbers $\vec{w}$.

The elements of $\vec{w}$ that have value 1 represent weak classifiers that are ‘turned on’, and the elements that have value 0 represent weak classifiers that are ‘turned off’. For the choice shown in the figure in the previous post, the vector $\vec{w}$ would have zeroes for all elements except 2, 4, 14, 33, 39 and 45, whose values would be 1. Here are three examples, for the D=52 dictionary defined in Algorithm 2:

$\vec{w} = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]$

This would represent the choice where no weak classifiers were included.

$\vec{w} = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]$

This would represent the choice where all 52 weak classifiers were included.

$\vec{w} = [0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0]$

This would represent the choice where weak classifiers 2, 4, 14, 33, 39 and 45 were included, and all the others were turned off.

For any item $x_s$ in our training set, each choice for $\vec{w}$ will produce a prediction for what the label should be. We can write this prediction as ${\cal F}(x_s) = sign \left[ \sum_{j=1}^D w_j F_j(x_s) \right]$ . Since all of our training data comes with labels $y_s$, we can compare this prediction to the actual label. The way we will do this is that if the prediction is correct, we want the function that compares them to return zero, and if the prediction is wrong, we want it to return 1. If we then sum over all the elements in the training data, we get

$L_0(\vec{w}) = {{1}\over{4}} \sum_{s=1}^S \left( sign \left[ \sum_{j=1}^D w_j F_j(x_s)\right] -y_s \right)^2$

The function $L_0(\vec{w})$  is called the zero-one loss and simply counts the number of errors that a particular choice for $\vec{w}$ made – if it got every single member of the training set right,  $L_0(\vec{w})$ will be zero. If it got them all wrong, $L_0(\vec{w})$ will be equal to S, the total number of elements in the training set.

In the QBC algorithm, it will be necessary to only include terms in the loss function that are either linear or quadratic in the $\vec{w}$  variables, because that’s the format that D-Wave hardware natively accepts. Because the $L_0(\vec{w})$ function has a ‘sign’ function in it, we won’t be able to use this exact functional form. Instead, we will use a related form called the quadratic loss.

The quadratic loss has the form

$L_1(\vec{w}) = \sum_{s=1}^S \left[ {{1}\over{D}} \sum_{j=1}^D w_j F_j(x_s) - y_s \right]^2$

Comparing this to the previous loss function, we see that the main difference is that we have removed the ‘sign’ function and replaced it with a normalized sum.

Having to convert the loss function from ‘the one we really want’ to ‘one the hardware can handle’ is not ideal, and is one of the drawbacks of the QBC algorithm. However it turns out that the quadratic loss performs well for building good classifiers in many circumstances. In later modules we’ll introduce procedures for expanding the kinds of optimization problems we can run in hardware, but for now we’ll just use the quadratic loss.

Regularization

When building a classifier, it is desirable to have as few weak classifiers included as possible – you want your strong classifier to be as simple as you can make it, while retaining some desired level of performance.

Overly complicated strong classifiers will tend to perform better on your validation set, but more poorly on your test set (and in the real world on examples it hasn’t seen yet). This happens because of a phenomenon known as over-fitting – a strong classifier formed using a large number of weak classifiers can adjust itself to the idiosyncrasies of your training set, and not generalize as well to as yet unseen examples as a simpler model.

You can think of over-fitting as being analogous to “memorization” of the training set features, without “understanding” what it is about these features that is behind the classification. A student who crams for an exam at the last moment by memorizing a large number of specific facts may very do quite well on the exam, because they have the specific facts they need in that situation memorized, but will have trouble applying the underlying concepts to other situations.

Here we will use what is known as zero-norm regularization to fight against over-fitting. This penalizes the inclusion of many weak classifiers.

The regularization term we will use is $R(\vec{w}) = \lambda \sum_{j=1}^D w_j$, where $\lambda$ is an as yet unspecified real number.

Ranking Performance: The Optimization Objective Function

Now that we have a loss function and a regularization term, we’re ready to define a procedure for ranking the performance of combinations of weak classifiers. Here we’ll try to introduce the basic concepts. We’ll formalize the procedure in a later post.

In order to find the best possible set of weak classifiers to form our strong classifier, we construct a function that is the sum of the loss function and the regularization term:

$E(\vec{w}) = L_1(\vec{w}) + R(\vec{w}) = \sum_{s=1}^S \left[ {{1}\over{D}} \sum_{j=1}^D w_j F_j(x_s) - y_s \right]^2 + \lambda \sum_{j=1}^D w_j$

We’ll refer to $E(\vec{w})$ as the optimization objective function. For each choice of $\vec{w}$,  $E(\vec{w})$ returns a number. The first term in  $E(\vec{w})$ — the loss function term — returns numbers that are lower the closer $\vec{w}$ gets to correctly labeling all elements of the training set. The second term in  $E(\vec{w})$ — the regularization term — just counts the number of ones in $\vec{w}$ , multiplied by some number $\lambda$.

Now let’s see how we can use $E(\vec{w})$ to do something useful. First, let’s find the settings of $\vec{w}$ that return the lowest possible value of $E(\vec{w})$ for some particular choice of $\lambda$ (say $\lambda=0$). We write this as

$\vec{w}^* = arg min_{\vec{w}} E(\vec{w}, \lambda = 0)$

The ‘arg min’ notation simply means ‘return the value of $\vec{w}$  that minimizes $E(\vec{w}, \lambda = 0)$’, and $\vec{w}^*$ is that value.

Alright so what is $\vec{w}^*$ ? It is just the list of weak classifiers that, when considered as a group, ‘perform best’ (meaning minimize the optimization objective function) for the choice of $\lambda$ we made. In the case where we set $\lambda=0$, what ‘performs best’ means is that the set of weak classifiers given by $\vec{w}^*$ has the lowest possible number of mismatches between predicted labels and actual labels for our training data.

We now have a possible candidate for our final strong classifier. Let’s call it $\vec{w}^* (\lambda = 0)$ to remember that it’s the result we found when using $\lambda = 0$.

If we repeat this process for many different values of $\lambda$ , the minimizer $\vec{w}^* (\lambda )$ will change. If we look at the limit where $\lambda$ is very large, the cost of including even a single weak classifier is prohibitive and $\vec{w}^*$ will become the vector of all zeroes.

So for every value we set $\lambda$ to, we get a candidate suggestion for a strong classifier. How can we select which of these might work best in the real world? The solutions returned for $\lambda \sim 0$ don’t take advantage of regularization, so you might expect these would suffer from over-fitting – the vector $\vec{w}^* (\lambda )$ would have too many ones in it (i.e. too many weak classifiers). The solutions returned for very large $\lambda$ will probably be too sparse to work well. Somewhere in between these will be the best choice.

In order to find what this best choice is, we now turn to our validation set. Recall that we partitioned our original data set into three groups – training, validation and test – and so far we’ve just used the training set.

For every value of $\lambda$ we tried, we got a solution $\vec{w}^* (\lambda )$ . We now rank these using the following procedure. For each solution $\vec{w}^* (\lambda )$  we compute the validation error

$E_v(\lambda) = {{1}\over{4}} \sum_{v=1}^V \left[ sign \left[ \sum_{j=1}^D w_j^*(\lambda) F_j(x_v) \right] - y_s \right]^2$

The validation error simply counts the number of elements of the validation set a strong classifier candidate mislabels. We can then rank the vectors $\vec{w}^* (\lambda )$ from lowest validation error (i.e. perform the best on the validation set) to highest validation error (perform the worst on the validation set).

We’d expect that $\vec{w}^* (\lambda )$ where $\lambda$ is either too high or too low would give higher validation errors, and somewhere in the middle is the lowest one. That’s the vector we want to output.

Once we have the value of  $\vec{w}^* (\lambda)$ that produces the smallest validation error, we’re done! That value encodes the final strong classifier that is the output of the QBC algorithm. It is the best possible strong classifier we could have made, given our choices for training and validation data and our dictionary of weak classifiers.

Notation: Massaging the Optimization Objective Function

In what follows it will be useful to have a couple of different ways of writing down our optimization objective function. In particular when we introduce the hardware, we usually write the form of the optimization problem in a different way that’s closer to the way we specify the machine language parameters of a chip.

The form we introduced in the previous section was

$E(\vec{w}) = \sum_{s=1}^S \left[ {{1}\over{D}} \sum_{j=1}^D w_j F_j(x_s) - y_s \right]^2 + \lambda \sum_{j=1}^D w_j$

One thing we should do here is expand out the squared term in brackets and throw away all of the terms that aren’t functions of $\vec{w}$ (these constants don’t affect the returned optimal value $\vec{w}^* (\lambda)$  so we can ignore them). When we do this we can write

$E(\vec{w}) = \sum_{j=1}^D Q_{jj} w_j + \sum_{i

where $Q_{jj} = {{D \lambda}\over{2S}} + {{1}\over{2D}} -{{1}\over{S}} \sum_{s=1}^S y_s F_j(x_s)$ and $Q_{ij} = {{1}\over{DS}} \sum_{s=1}^S F_i(x_s) F_j(x_s)$.  Note that we multiplied all terms by a constant scaling factor  ${{D}\over{2S}}$ just so that it’s easier to keep track of the ranges of the terms in the objective function.

We are going to refer to this way of writing the optimization objective function as the QUBO representation. QUBO stands for Quadratic Unconstrained Binary Optimization.

You should convince yourself that multiplying out the original optimization objective function and collecting terms gives the expression above.

For the purposes of coding up the QBC algorithm, the QUBO representation is all we will need. However as we’ll see in the next post, a slightly different way of looking at the problem is more natural for understanding what the hardware is doing.

If we make a change of variables from 0/1 binary variables $\vec{w}$  to -1/+1 ‘spin’ variables $\vec{z}$  by substituting $w_j = {{1+z_j}\over{2}}$  into our QUBO representation, we can write our optimization objective function as

$E(\vec{z}) = \sum_{j=1}^D Q_{jj} \left[ {{1+z_j}\over{2}} \right] + \sum_{i

If we expand everything out, collect terms and throw away everything that doesn’t depend on the new optimization variables, we get

$E(\vec{z}) = \sum_{j=1}^D h_j z_j + \sum_{i

where  $h_j = {{Q_{jj}}\over{2}} + \sum_{i=1}^D {{Q_{ij}}\over{4}}$ and  $J_{ij} = {{Q_{ij}}\over{4}}$. We’ll refer to this version as the Ising representation, and we’ll see why this is a useful way to look at the problem in the next post.

## 4 thoughts on “Learning to program the D-Wave One: loss function, regularization and some notation”

1. 水平太差看不懂,还得多学习。

2. Thanks for posting these notes – I look forward to the rest of the series.

BTW, I think there is a very minor typo in the equation for E_{v}(\lambda) near the end of ‘Ranking Performance’; the last variable y_s should be y_v.