Sparse coding is a very interesting idea that we’ve been experimenting with. It is a way to find ‘maximally repeating patterns’ in data, and use these as a basis for representing that data.

Some of what’s going to follow here is quite technical. However these are beautiful ideas. They are probably related to how human perception and cognition functions. I think sparse coding is much more interesting and important than, say, the Higgs Boson.

**Everything starts from data**

Sparse coding requires data. You can think of ‘data’ as a (usually large) set of objects, where the objects each can be represented by a list of real numbers. As as example, we’ll use the somewhat pathological MNIST handwritten digits data set. But you can use any dataset you can imagine. Each data object in this set is a small (28×28 pixel) greyscale image of a handwritten digit. A 28×28 pixel greyscale image can be represented by 28×28 = 784 numbers, each in the range 0..255. The training set has 60,000 of these.

We can represent the entire data set using a two-dimensional array, where there are 60,000 columns (one per image), and 784 rows (one for each pixel). Let’s call this array .

**Technical detail: this thing is bigger than it has to be**

One thing you may notice about MNIST is that each image kind of looks mostly similar. In fact we can exploit this to get a quick compression of the data. The trick we will use is called Singular Value Decomposition (SVD). SVD quickly finds a representation of the data that allows you to reduce its dimensionality. In the case of MNIST, it turns out that instead of using 784 pixel values, we can get away with using around 30 SVD modes instead, with only a small degradation in image quality. If we perform an SVD on and keep only the parts corresponding to the largest 30 singular values, we get a new matrix which still has 60,000 columns, but only 30 rows. Let’s call the column vector , which stores the compressed image. This trick, and others related to it, can always be used to pre-process any raw data we have.

**The Dictionary**

Let’s now create a small number — let’s say 32 — of very special images. The exact number of these actually matters quite a bit (it’s an example of something called a hyperparameter, many of which lurk in machine learning algorithms and are generally a giant pain in the ass), but the most important thing is that it should be larger than the dimension of the data (which in this case is 30). When this is the case, it’s called an overcomplete basis, which is good for reasons you can read about here.

These images will be in the exact same format as the data we’re learning over. We’ll put these in a new two dimensional array, which will have 32 columns (one for each image) and 30 rows (the same as the post-SVD compressed images above). We’ll call this array the dictionary. Its columns are dictionary atoms. The dictionary atom is the column vector . These dictionary atoms will store the ‘maximally repeating patterns’ in our data that are what we’re looking for.

At this stage, we don’t know what they are — we need to learn them from the data.

**The sparse coding problem**

Now that we have a data array, and a placeholder array for our dictionary, we are ready to state the sparse coding problem.

Now to see how all this works, imagine we want to reconstruct an image (say ) with our dictionary atoms. Imagine we can only either include or exclude each atom, and the reconstruction is a linear combination of the atoms. Furthermore, we want the reconstruction to be sparse, which means we only want to turn on a small number of our atoms. We can formalize this by asking for the solution of an optimization problem.

**L0-norm sparse coding**

Define a vector of binary (0/1) numbers of length 32. Now solve this problem:

Find and (remember the vectors are column vectors in the matrix ) that minimize

The real number is called a regularization parameter (another one of those hyperparameters). The larger this number is, the bigger the penalty for adding more dictionary atoms to the reconstruction — the rightmost term counts the number of atoms used in the reconstruction. The first term is a measure of the reconstruction error. Minimizing this means minimizing the distance between the data (sometimes called the ground truth) and the reconstruction of the image .

[Note that this prescription for sparse coding is different that the one typically used, where the weights are real valued and the regularization term is of the L1 (sum over absolute values of the weights) form.]

You may see a simple way to globally optimize this. All you have to do is set , for all the other k, and for all the other k — in other words, store the image in one of the dictionary atoms and only turn that one on to reconstruct. Then the reconstruction is perfect, and you only used one dictionary atom to do it!

OK well that’s useless. But now say we sum over all the images in our data set (in the case of MNIST this is 60,000 images). Now instead of the number of images being less than the number of dictionary atoms, it’s much, much larger. The trick of just ‘memorizing’ the images in the dictionary won’t work any more.

**Now over all the data
**

Let’s call the total number of data objects . In our MNIST case . We now need to find (this is a 32xS array) and that minimize

This is the full sparse coding prescription — in the next post I’ll describe how to solve it, what the results look like, and introduce a bunch of ways to make good use of the dictionary we’ve just learned!

Looking forward to that next post!