IV. Tremble before the awesome power of GloboBoost

In the previous post I provided a dictionary of weak classifiers we can use. Now I want to introduce the problem we will use the hardware to help solve.

Different types of boosting use different techniques to determine the weights $w_j$ for the weak classifiers $h_j$. What we will do is introduce a new technique to determine a set of $w_j$ to create a strong classifier.

As I mentioned in the previous post, the first step is to restrict $w_j$ to be binary (0/1). After imposing this restriction we cast the problem of finding a desired set of them as an optimization problem. We want to minimize classification error on the training set in the presence of a regularization term. To do this we form an objective function with two parts.

The first is a quadratic loss function, designed to minimize training error. This loss function is the “Least Squares” type and is analyzed (along with several other loss functions) here. I found this reference very useful and recommend spending some time with it if you are new to machine learning. I am going to use the same notation here as in this paper.

Given an observed input vector $\vec{x}$ and an unobserved output value $y$ we wish to learn a predictor function $f(\vec{x})$ (this will be our strong classifier) such that $y \approx f(\vec{x})$. In other words we want to find a way to predict an outcome $y$ given a vector $\vec{x}$. Pairs $(f(\vec{x}),y)$ are drawn from some underlying distribution $D$ that is problem dependent. As an example, we could define $D$ to be the set of pairs $(f(\vec{x}),y)$ where $\vec{x}$ is every possible 1080×720 black and white picture and $y =+1$ if the picture has a face in it and $y=-1$ if it doesn’t.

We will consider for now the case of binary classification ($y=\pm 1$) with the following prediction rule: predict $y=1$ if $f(\vec{x})\geq0$, and predict $y=-1$ otherwise.

Now as you might have noticed the domain $D$ can be ridiculously large. In the example above, just assuming each pixel can be black or white gives something $\gg 2^{1000}$ possible picture / label combinations. So even though what we really want is to find $f(\vec{x})$ that is minimized over the whole domain, in practice we have to sample from it.

So what do we want to sample? It turns out that minimizing a quantity called risk is equivalent to minimizing the classification error. Risk is defined to be

$Q = \left< \phi(f(X)Y) \right>_{X,Y}$

where $\phi$ is a one variable convex function yet to be specified and the expectation value is computed over the entire domain $D$. In practice $D$ is too big so we sample from it $S$ times to compute a stochastic approximation:

$Q \sim {{1}\over{S}} \sum_{s=1}^S \phi \left( f(\vec{x}_s) y_s \right)$

In our case we choose the “Least Squares” function $\phi(v)=(1-v)^2$. With this choice we want to minimize

$Q \sim {{1}\over{S}} \sum_{s=1}^S \left( 1 - f ( \vec{x}_s ) y_s \right)^2$

In the previous post I defined what I want the $f(\vec{x})$ function to look like (a sum over weak classifiers). Inserting that definition gives

$Q(w_1, ..., w_N) = {{1}\over{S}} \sum_{s=1}^S \left( 1 - {{y_{s}}\over{N}} \sum_{j=1}^N w_j h_j \left( \vec{x}_{s} \right) \right)^2$

Define

$\Lambda_j \equiv {{1}\over{S}} \sum_{s=1}^S h_j(\vec{x}_s) y_s$

$\Lambda_{jk} \equiv {{1}\over{S}} \sum_{s=1}^S h_j(\vec{x}_s) h_k(\vec{s}_s)$

Both these quantities have ranges $[-1..+1]$. If we drop the constant term in $Q$, then minimizing $Q$ is equivalent to minimizing

$L(w_1, ..., w_N)=\sum_{j=1}^N w_j \left[ {{1}\over{N^2}} - {{2 \Lambda_j}\over{N}} \right] + \sum_{j

This is the form we will use to represent the minimization of classification error.

The second term we will include is a zero norm regularization term, designed to minimize the number of weak classifiers used and prevent overlearning:

$R(w_1, ..., w_N)=\lambda \sum_{j=1}^N w_j$

Here $\lambda$ is a regularization parameter which we will determine using a procedure I will describe shortly. $R$ grows with the number of “switched on” weak classifiers.

The desired settings of the weights $w_j$ are then given by

$w^{opt}=arg min_w \left(L(\vec{w})+R(\vec{w}) \right)$

The objective function here is a quadratic unconstrained binary optimization (QUBO) problem:

$w^{opt}=arg min_w \sum_{i where

$Q_{jj}=\lambda +{{1}\over{N^2}} -{{2 \Lambda_j}\over{N}}$

$Q_{ij}={{2 \Lambda_{ij}}\over{N^2}}$

We can now convert this to the native format of the hardware by converting to $\pm 1$ variables from $0/1$ variables. Making the substitution $w_j \rightarrow (z_j+1)/2$ and dropping the constant term gives an equivalent objective function

$E(z_1, ..., z_n)=\sum_{j=1}^N b^*_j z_j + \sum_{i < j}^N J^*_{ij} z_i z_j$

where

$b^*_j = {{1}\over{4}} \left[ \sum_{k=1}^j Q_{kj} + \sum_{k=j}^N Q_{jk}\right]$

$J^*_{ij} ={{Q_{ij}}\over{4}}$

Explicitly these are

$b^*_j = \left[{{\lambda}\over{2}} + {{1}\over{2N^2}} - {{\Lambda_j}\over{N}} +{{1}\over{2N^2}} \sum_{k \neq j=1}^N \Lambda_{kj} \right]$

$J^*_{ij}= {{\Lambda_{ij}}\over{2N^2}}$

We now have recast the problem of finding the weights $w_j$ in the form the hardware is designed to solve.

Once we’ve calculated the $b^*_j$ and $J^*_{ij}$ numbers, we perform the following additional step: find the entry with the largest magnitude, and divide every entry by this magnitude. The new normalized coefficients $b_j$ and $J_{ij}$ will then be in the range $-1 \leq b_j, J_{ij} \leq +1$. Normalizing all coefficients in this way isn’t required if we use a software solver but if we’re going to put the problem into hardware doing this is useful.

After doing this we have the final form for our objective function, and the weights $w_j$ we are seeking are found by (1) finding the global minimum of this function and (2) converting back from spin to binary $z_j \rightarrow 2 w_j +1$.

Solving the problem with finite precision

One interesting “feature” of the hardware is that the $b_j$ and $J_{ij}$ can only be specified to a rather small number of values. Physically this arises because when we try to set the value of a magnetic flux applied to some inductive structure (say the main loop of an RF-squid flux qubit) to some specific value, the actual value set will not be the desired value, but the desired value plus or minus some spread. This spread arises from a host of factors, including for example 1/f magnetic flux noise coming from the materials from which the chip is made.

Therefore when specifying a problem for sending to hardware, we need to approximate the actual desired $b_j$ and $J_{ij}$ with finite precision versions. The way we do this is by specifying the desired bits of precision, and then rounding the actual values to the nearest allowed values.

It is interesting to analyze how this approximation affects the quality of the eventual classifier. That it does affect it is not surprising, although how many bits of precision are required to do interesting things is still not entirely clear.

An Example

Let’s look at the situation where both the positive and negative examples have mean zero, ie. the two clouds are maximally overlapping. We then fix the standard deviation of the positive examples to be = 1, and vary the standard deviation of negative examples between 0.5 and 1.5. When the two standard deviations are both the same, it should be impossible to tell the positives from the negatives, and our strong classifier should return the correct answer 50% of the time (ie it can’t do better than guessing). When the ratio of standard deviations moves away from 1, the classifier can do better.

Here is the output of a run of the full Matlab code, which I include at the end of this post . Note that I am using a third party tool set (free) to generate the Gaussian distributions which you can download from here. If you don’t want to use these, just generate your own distributions of positive and negative points and take out the synthetic data generating routine I’m using.

You can see that the classifier accuracy is going to 50% as it should when the clouds are indistinguishable, and that the accuracy grows away from this point. It might be surprising at first that the curves are asymmetric about this point but there’s a good reason for it (if you take a look at the Matlab code there is a section for computing the Bayes error, which gives a limit on how well we can do given some set of training data–thanks Vasil for the code!). Also we see that only 3 bits of precision is enough to specify these problems well enough to be indistinguishable from using the exact values.

Here is a figure showing the same thing over a bigger range of standard deviation ratios. One interesting thing about this figure is the worst-performing curve. There I set all the quadratic terms to zero (J=0) and only allowed 2 bits of precision on the h values. Still works OK.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% This is code for building binary classifiers using the global
% optimization approach.
%
% Everything is done with brute force. Efficiency is for the weak. Man’s
% Game!!!
%
% This function takes as input the standard deviation of the cloud of
% negative examples (sdr), the bits of precision to which we specify the
% ISING version of the problem (BOP), and a flag to choose between QUBO
% version (quboflag=1) and ISING version (quboflag=0). It outputs the
% accuracy of the classifier (accfin) and the number of weak classifiers
% used (classsize).

function [accfin,classsize]=crunch(sdr,BOP,quboflag)

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% This call is necessary to specify the server the web services will call.
% server name as is.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% These are some definitions of variables used.
% M=number of dimensions the data lives in.
% S=total number of training set data points (S is even)
% sdb=standard deviation of blue points = positive examples
% sdr=standard deviation of red points = negative examples
% mv=distance between centers of clusters
% bluepositives and rednegatives are S/2 x M matrices storing the
% training set data
% Lambda = regularization parameter
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

M=2; S=100; sdb=1.; mv=0; Lambda=0;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% exactsolve=1 means exhaustive search; exactsolve=0 means use web
% services; this is set up to do enumeration on small problems and let the
% web services take over for bigger problems.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

exactsolve=1;
if (M>6) exactsolve=0; end; % if problem is to big to solve exactly give to web services

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% This loads the synthetic data for both positive training set data (the
% blue points) and negative training set data (the red points).
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

for k=1:M mvb(k)=0; mvr(k)=0; end;     % set mean values for each cloud in all M dimensions
for k=1:M varb(k)=sdb^2; varr(k)=sdr^2; end; % set variances for each cloud in all M dimensions

dblue=diagdens(‘m’,mvb,’var’,varb);
dred=diagdens(‘m’,mvr,’var’,varr);

bluepositives = rand(dblue,S/2);   % These return S/2 by M matrices.
rednegatives  = rand(dred,S/2);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% Compute Bayes error (theoretical bound on best possible error rate with
% our training set). In order to use this you first have to plot the
% histograms to find where the crossing is for the parameters you intend to
% use. You can ignore this.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

bayesflag=0;

if (bayesflag==1)
for s = 1:S/2
distances_positive(s) = sqrt(sum(bluepositives(s,:) .^ 2));
distances_negative(s) = sqrt(sum(rednegatives(s,:) .^ 2));
end;
distances_positive = sort(distances_positive);
distances_negative = sort(distances_negative);
crossing = 5;
errors1 = sum((distances_negative > crossing));
errors2 = sum((distances_positive <= crossing));
bayes_error = (errors1 + errors2)/S;
disp(bayes_error);
x = 0:0.1:10;
figure;
hold on;
h1 = plot(x,hist(distances_positive,x));
set(h1, ‘color’, ‘blue’);
hold on;
h2 = plot(x,hist(distances_negative,x));
set(h2, ‘color’, ‘red’);
legend(‘positive’,’negative’);
end;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% This loads the list of possible thetas.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

for d=1:M
datalist(:,d)=cat(1,bluepositives(:,d),rednegatives(:,d));
sorteddatalist(:,d)=sort(datalist(:,d));
for k=1:S-1 possiblethetavalues(k,d)=(sorteddatalist(k+1,d)+sorteddatalist(k,d))/2; end;
end;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% This calculates the weighted error for each possible theta.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

bluemisclasscost=ones(S/2,1);
redmisclasscost=ones(S/2,1);

for d=1:M
for k=1:S-1     % look at each possible theta value
errplus(k)=0; errminus(k)=0;
for p=1:S/2 % look at each data point to see if it’s classified correctly

%   Start with blue points. For these we get a classification error if the
%   weak classifier guesses negative.

hp=sign(bluepositives(p,d)-possiblethetavalues(k,d));
hm=sign(-bluepositives(p,d)-possiblethetavalues(k,d));
if (hp<0) errplus(k)=errplus(k)+bluemisclasscost(p); end;
if (hm<0) errminus(k)=errminus(k)+bluemisclasscost(p); end;

%   Same with red points. For these we get a classification error if the
%   weak classifier guesses positive.

hp=sign(rednegatives(p,d)-possiblethetavalues(k,d));
hm=sign(-rednegatives(p,d)-possiblethetavalues(k,d));
if (hp>0) errplus(k)=errplus(k)+redmisclasscost(p); end;
if (hm>0) errminus(k)=errminus(k)+redmisclasscost(p); end;
end;
end;
[B,IX]=sort(errplus); thetaplus(d)=possiblethetavalues(IX(1),d);
[B,IX]=sort(errminus); thetaminus(d)=possiblethetavalues(IX(1),d);
end;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% Now we calculate the diagonal terms in the qubo.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Q=sparse(2*M,2*M);

for j=1:M
tot=0;
for k=1:S/2
tot=tot+sign(bluepositives(k,j)-thetaplus(j))-sign(rednegatives(k,j)-thetaplus(j));
end;
Q(j,j)=Lambda+1/(2*M)^2-tot/(M*S);
end;
for j=M+1:2*M
tot=0;
for k=1:S/2
tot=tot+sign(-bluepositives(k,j-M)-thetaminus(j-M))-sign(-rednegatives(k,j-M)-thetaminus(j-M));
end;
Q(j,j)=Lambda+1/(2*M)^2-tot/(M*S);
end;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% Now we calculate the offdiagonal terms in the qubo.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

for j=2:M
for i=1:j-1
tot=0;
for k=1:S/2
tot=tot+sign(bluepositives(k,i)-thetaplus(i))*sign(bluepositives(k,j)-thetaplus(j))+sign(rednegatives(k,i)-thetaplus(i))*sign(rednegatives(k,j)-thetaplus(j));
end;
Q(i,j)=2*tot/(2*M)^2/S;
end;
end;

for j=M+1:2*M
for i=1:M
tot=0;
for k=1:S/2
tot=tot+sign(bluepositives(k,i)-thetaplus(i))*sign(-bluepositives(k,j-M)-thetaminus(j-M))+sign(rednegatives(k,i)-thetaplus(i))*sign(-rednegatives(k,j-M)-thetaminus(j-M));
end;
Q(i,j)=2*tot/(2*M)^2/S;
end;
end;

for j=M+2:2*M
for i=M+1:j-1
tot=0;
for k=1:S/2
tot=tot+sign(-bluepositives(k,i-M)-thetaminus(i-M))*sign(-bluepositives(k,j-M)-thetaminus(j-M))+sign(-rednegatives(k,i-M)-thetaminus(i-M))*sign(-rednegatives(k,j-M)-thetaminus(j-M));
end;
Q(i,j)=2*tot/(2*M)^2/S;
end;
end;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% Now we convert the qubo into ising format.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

for j=1:2*M
tot=0;
for k=1:j
tot=tot+Q(k,j);
end;
for k=j:2*M
tot=tot+Q(j,k);
end;
b(j)=tot/4;
end;

J=sparse(2*M, 2*M);

for j=2:2*M
for i=1:j-1
J(i,j)=Q(i,j)/4;
end;
end;

g(1)=max(abs(b)); g(2)=max(max(abs(J))); big=max(g);
b=b/big*(2^(BOP-1)-1); J=J/big*(2^(BOP-1)-1);
b=round(b);J=round(J);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% Now we solve the optimization problem.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

if (exactsolve==1) % solve using exhaustive search
if (quboflag==1) % solve exact qubo
TEmin=1000;
for k=1:2^(2*M)
TE(k)=0;
for j=1:2*M
s(j)=bitget(k-1,j);
end;

for j=1:2*M
TE(k)=TE(k)+s(j)*Q(j,j);
for i=1:j-1
TE(k)=TE(k)+s(i)*s(j)*Q(i,j);
end;
end;
if (TE(k)<TEmin) TEmin=TE(k); svec=s; end;
end;
else % solve truncated ISING
TEmin=1000;
for k=1:2^(2*M)
TE(k)=0;
for j=1:2*M
s(j)=2*bitget(k-1,j)-1;
end;

for j=1:2*M
TE(k)=TE(k)+s(j)*b(j);
for i=1:j-1
TE(k)=TE(k)+s(i)*s(j)*J(i,j);
end;
end;
if (TE(k)<TEmin) TEmin=TE(k); svec=s; end;
end;
svec=(svec+1)/2; % convert to 0/1
end;
else % solve using web services
if (quboflag==1) % solve exact qubo
Qmod=round(Q*100000);
[svec f] = orionSubmitProblemBQP(Qmod, orionServer);
else % solve truncated ISING
[svec f] = orionSubmitProblemIsing(b, J, orionServer);
svec=(svec+1)/2; % convert to 0/1
end;
end;
classsize=sum(round(svec));

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% Now compute threshold T
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

tot=0;
for k=1:S/2
for j=1:M
tot=tot+svec(j)*(sign(bluepositives(k,j)-thetaplus(j))+sign(rednegatives(k,j)-thetaplus(j)))+svec(j+M)*(sign(-bluepositives(k,j)-thetaminus(j))+sign(-rednegatives(k,j)-thetaminus(j)));
end;
end;
T=tot/S;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% Now compute how well the classifier works.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

testpositives = rand(dblue,S/2);
testnegatives  = rand(dred, S/2);

for k=1:S/2
tot=0;
for j=1:M
tot=tot+svec(j)*sign(testpositives(k,j)-thetaplus(j));
end;
for j=M+1:2*M
tot=tot+svec(j)*sign(-testpositives(k,j-M)-thetaminus(j-M));
end;
if (abs(tot-T)>0.001)
prediction(k)=sign(tot-T);
else
prediction(k)=-1;
end;
end;

finalerrorp=0;
for k=1:S/2
if (prediction(k)<0)
finalerrorp=finalerrorp+1;
end;
end;

for k=1:S/2
tot=0;
for j=1:M
tot=tot+svec(j)*sign(testnegatives(k,j)-thetaplus(j));
end;
for j=M+1:2*M
tot=tot+svec(j)*sign(-testnegatives(k,j-M)-thetaminus(j-M));
end;
if (abs(tot-T)>0.001)
prediction(k)=sign(tot-T);
else
prediction(k)=-1; % note: I’m making them guess 50% right if svec=0.
end;
end;

finalerrorn=0;
for k=1:S/2
if (prediction(k)>0)
finalerrorn=finalerrorn+1;
end;
end;

accfin=1-(finalerrorp+finalerrorn)/S;

This entry was posted in D-Wave Science & Technology by Geordie. Bookmark the permalink.