Mining market graphs September 9, 2008
Posted by Geordie in For Developers.trackback
One way to represent financial instruments, and the relations among them, is as follows. Represent each financial instrument j as a vertex V_j in a graph, and represent relationships between instruments i and j as edges E_ij connecting these vertices.
Facts about the entities themselves may then reside in labels associated with the vertices, and facts about pairs of instruments may be applied as labels on the edges. After generating such a labeled graph, queries of this information may naturally be recast as graph theory problems such as Maximum Independent Set.
Here is a simple example. Restrict the financial instruments to be instruments included in the Dow Jones Wilshire 5000 Composite Index. There are as of today more than 5,000 of these. This set is considered to be the best measure of the entire US stock market. No other index comes close to offering its comprehensiveness. Furthermore associate to each vertex no labels other than the 1-1 mapping between vertex # and ticker symbol–for example, V_456 = vertex # 456 might be associated with MSFT.
Draw an edge E_ij between vertices i and j if the absolute value of a number Q_ij is larger than a threshold K. Q_ij is generated via the following prescription
where
is the return on stock i on day t (P_i(t) is the price of stock i on day t). The larger K is the more edges will be drawn.
We now have a graph with ~ 5,000 vertices and up to 5000^2/2 ~12.5 million (but likely in practice many fewer) edges.
Once this graph is constructed, for example by pulling required financial information off of a public repository such as the Google Finance site, we can ask questions of the data. One of these is as follows: Find the Maximum Independent Set (MIS) of the graph. Because our edge drawing condition is related to correlations between stock prices (Q_ij close to +1 means high correlation between stocks i and j, Q_ij close to -1 means high anti-correlation, Q_ij close to zero means small correlation), the MIS provides a maximally diversified set of instruments. All of the instruments in the MIS have small correlations to all other instruments. No two instruments in the set are correlated more that the user-defined threshold K.
In order to solve this type of problem using our web services APIs, first download and install the Orion client software, write an application in your language of choice that generates the market graph as above by for example polling and processing data from Google Financial, and once the market graph has been created submit the problem to the web services server as described here and in the client software.
This basic example has been discussed a lot in the literature, see for example here and references therein. Many obvious extensions can be made by extending edge and vertex labeling conventions and solving generalized versions of MIS, such as maximum weighted independent set, or quadratic unconstrained binary optimization. In addition, the numbers of instruments treated can potentially be vastly expanded to include, for example, all traded financial instruments, including derivatives, forex, etc.
Here is a worked example to show how to enable the basic example described above. We select a group of ten stocks to work with: the NASDAQ listings of Google (GOOG), Microsoft (MSFT), Yahoo (YHOO), Time Warner (TWX), Affymetrix (AFFX), General Electric (GE), QLT (QLTI), ZymoGenetics (ZGEN), Apple (AAPL) and Baidu (BIDU).
For each of these stocks we do the following. Go to Google Finance and type in a ticker symbol (for example, GOOG) in the Get Quotes option. Click on the Historical Prices link. This will bring up a table of pricing information. Click on Download to Spreadsheet. This loads the data you see into an Excel spreadsheet.
For this example we will use the Closing Price data for each stock to represent the price variables P_i(t), and we will use a total of 10 days of closing prices. If we identify vertex 1 in our market graph with Google, then the variables P_1(t) are the closing values of the stock for the last ten days of data (the rightmost column in the following table)
| Date | Close | |
| P_1(10) | 9-Sep-08 | 418.66 |
| P_1(9) | 8-Sep-08 | 419.95 |
| P_1(8) | 5-Sep-08 | 444.25 |
| P_1(7) | 4-Sep-08 | 450.26 |
| P_1(6) | 3-Sep-08 | 464.41 |
| P_1(5) | 2-Sep-08 | 465.25 |
| P_1(4) | 29-Aug-08 | 463.29 |
| P_1(3) | 28-Aug-08 | 473.78 |
| P_1(2) | 27-Aug-08 | 468.58 |
| P_1(1) | 26-Aug-08 | 474.16 |
Similarly with the other stocks. We choose to label them in the order presented above (Google = 1, Microsoft =2, etc.).
Now that we have all of the P_i(t) information loaded, we can calculate the R_i(t) quantities using the equation given above. For the GOOG data shown here, this looks like
| R_1(10) | -0.00308 |
| R_1(9) | -0.05625 |
| R_1(8) | -0.01344 |
| R_1(7) | -0.03094 |
| R_1(6) | -0.00181 |
| R_1(5) | 0.004222 |
| R_1(4) | -0.02239 |
| R_1(3) | 0.011036 |
| R_1(2) | -0.01184 |
Now that we have R_i(t) we can calculate the Q_ij quantities. For example, the quantity <R_1(t)> is the time average of the data in the above table–in other words, to calculate <R_1(t)>, sum all R_1(t) and divide by the number of times considered. This gives <R_1(t)>=-0.01383 for this data. To calculate <R_i^2-<R_i>^2>, we calculate the time average of the squares of the R_1(t) minus <R_1(t)>^2. For this data this works out to <R_1^2-<R_1>^2>=0.000374914. The last average we need to calculate to find Q_ij is the <R_i(t) R_j(t)> average, which is the only term that depends on two stocks. Let’s select MSFT as the other stock. The price information is
| Date | Close | |
| P_2(10) | 9-Sep-08 | 26.1 |
| P_2(9) | 8-Sep-08 | 26.12 |
| P_2(8) | 5-Sep-08 | 25.65 |
| P_2(7) | 4-Sep-08 | 26.35 |
| P_2(6) | 3-Sep-08 | 26.9 |
| P_2(5) | 2-Sep-08 | 27.1 |
| P_2(4) | 29-Aug-08 | 27.29 |
| P_2(3) | 28-Aug-08 | 27.94 |
| P_2(2) | 27-Aug-08 | 27.56 |
| P_2(1) | 26-Aug-08 | 27.27 |
The R_2(t) numbers are
| R_2(10) | -0.00077 |
| R_2(9) | 0.018158 |
| R_2(8) | -0.02692 |
| R_2(7) | -0.02066 |
| R_2(6) | -0.00741 |
| R_2(5) | -0.00699 |
| R_2(4) | -0.02354 |
| R_2(3) | 0.013694 |
| R_2(2) | 0.010578 |
The cross terms R_1(t) R_2(t) are now just row by row multiplication of the R_1 and R_2 numbers, like this:
| R_2(10)*R_1(10) | 2.35659E-06 |
| R_2(9)*R_1(9) | -0.001021405 |
| R_2(8)*R_1(8) | 0.000361807 |
| R_2(7)*R_1(7) | 0.000639212 |
| R_2(6)*R_1(6) | 1.33861E-05 |
| R_2(5)*R_1(5) | -2.94953E-05 |
| R_2(4)*R_1(4) | 0.000527036 |
| R_2(3)*R_1(3) | 0.000151129 |
| R_2(2)*R_1(2) | -0.000125225 |
The time average is <R_1(t) R_2(t)>=5.76446E-05.
Given N vertices, there are N(N-1)/2 of these pairwise values to calculate. In our example, N=10 giving 45 possible pairings of stocks. Once these are calculated, the Q_ij values can be calculated. In our example, Q_12=-0.03202. Listed in dictionary order, here are all the Q_ij for this example:
| -0.03202 |
| 0.008955 |
| -0.02789 |
| -0.07682 |
| -0.14103 |
| 0.511109 |
| 0.329022 |
| 0.20468 |
| -0.1307 |
| 0.375125 |
| 0.529226 |
| 0.448831 |
| 0.581491 |
| -0.3952 |
| 0.158967 |
| 0.312507 |
| 0.388969 |
| 0.569998 |
| -0.02922 |
| 0.648822 |
| 0.471136 |
| 0.027946 |
| 0.837878 |
| 0.801457 |
| 0.68869 |
| 0.783283 |
| 0.122789 |
| 0.141996 |
| 0.399221 |
| 0.84458 |
| 0.465785 |
| -0.23881 |
| -0.09611 |
| -0.30836 |
| 0.267851 |
| 0.219876 |
| 0.040118 |
| 0.573154 |
| 0.705367 |
| 0.092723 |
| 0.52493 |
| 0.274931 |
| 0.414758 |
| -0.02488 |
| 0.629357 |
Our edge drawing condition is to draw an edge E_ij between vertices i and j if the absolute value of Q_ij is greater than some user-defined threshold K. Remember that Q_ij is a measure of the correlation of stocks i and j, so drawing an edge with this condition means that i and j are correlated or anti-correlated over the threshold K. The lower K is, the more edges will be drawn and the more stringent the test for low correlation.
To see how this works, choose K=0.5. This generates a total of 14 edges, as there are 14 |Q_ij| values greater than 0.5. In DIMACS format this graph is represented as
p edge 10 14
e 1 7
e 2 4
e 2 6
e 3 4
e 3 6
e 3 9
e 3 10
e 4 5
e 4 6
e 4 10
e 6 9
e 6 10
e 7 9
e 9 10
Feeding this problem to Orion web services’ Maximum Independent Set solver gives output 1110100100 5, which tells us that vertices 1, 2, 3, 5 and 8 are in the (size 5) Maximum Independent Set. Here is the market graph with the Maximum Independent Set colored dark blue:
So in this case a diversified portfolio would be Google, ZymoGenetics, Yahoo, Affymetrix and Microsoft. If we wanted to increase the stringency of the diversification test, we could lower K to say 0.1, giving a much more highly connected graph and smaller Maximum Independent Set.

Oh great, thanks for ruining my Tuesday night. I was planning on watching a movie but now I’m going to be looking at your APIs.
Dave: Unless the movie was Rocky III you made the right choice… oh yeah or Aliens. Strange Brew maybe.
Maybe this is useful?
http://faq.wordpress.com/2007/02/18/can-i-put-math-or-equations-in-my-posts/
So your Q_ij is just a function of, say, today’s stock price’s and yesterday’s prices. What would you suggest if you wanted to take into account a month of activity? Do many max cliques, one for each day of the month, or base the Q_ij and R_ij on today’s prices and the prices from a month ago.
Sweet Dave thanks. I think if you wanted to track over time you’d do the former, ie compute the maximally diversified portfolio every day, or as short a period as you can. I think you’ll find certain stocks are correlated over different timescales. Some will remain correlated regardless of the time difference (seconds vs. months), some will be correlated only over short times, some might be correlated only over long times.
[...] from Geordie Rose the founder of D-Wave. Here are recent posts by him on practical applications: 1 2 3 [...]
I seem to recall that on average there was almost no correlation between stocks separated by a day, but yeah, there might be pockets of correlation or correlation on different timescales. If you could find such correlations that are consistent enough and greater than inflation, you could make some money in the long-term by repeatedly predicting the future better than random (until everyone else starts doing the same thing).
A good movie to go with this post is π, mmmmm ants. And drills…mmm, trepanation.
Oh, and badly played go, horrible.
Geordie, I know what Orion should be good for, and that is kicking ass at go. It’s play tree is way way too large for a brute force method like the chess computers, and it’s “aesthetic” pattern recognition is something that humans seem optimized for, as opposed to a classical computer. Currently the best software on a classical computer can get it’s ass kicked by amateurs giving it a ridiculous handicap (something like 30 stones…). I, on the other hand would still probably get my ass handed to me
“Naysay now, naysayers!”