# Monthly Archives: August 2010

## Evaluating High Dimensional Expectation

Here I am going to outline a simple approach to evaluating a high dimensional discrete expectation on empirical data.   For example, if one considers the evaluating an expectation like:

Let’s say X is of dimension 10 and we have a 10000 x 10 matrix of observations.   Furthermore each of the 10 variables can take 5 possible states:

X0 = { 1,4,4,1,1,1,1,3,3,2 }
X1 = { 1,2,2,2,1,1,1,5,5,2 }
….

The total number of possible distinct state vectors is 5^10 or over 9 million.   A naive approach to the above integral is to consider all possible sequences of X, but this would be enormously expensive and will not scale as the dimension or number of individual states grows.

The distribution is going to be sparse.  We know this because 5^10 >> 10,000 observations.   The worst case algorithm would involve a scan for a given pattern, calculating the frequency.   This poor approach would have complexity  O(N S^d), where N is the number of samples, S is the number of distinct symbols, and d is the dimension.

Some simple observations:

1. It makes sense to restrict the domain of the integral to existant sequences (get rid of complexity S^d)
2. It makes sense to avoid multiple passes, where naively there would be a pass for each frequency calculation

Since we are dealing with a finite sample and known symbols there is an approach that is O (N log N).   We create a binary or n-nary tree ordered lexically by symbol sequence.   Iterating through the samples, for each sample locate or insert a node in the tree, tracking its frequency.   Increment frequency.

The tree is rearranged as necessary to create balance and reduce the average path length to log N.  In one pass we now know both the domain of the distribution and the frequencies.   The expectation is now straight-forward to evaluate.

Note that instead of using a tree, one can use a hash-table keyed off of sequences.   This will result in an O(N) algorithm.   For some applications, though, it is useful to have a tree that places neighbors close to each other (for instance if we want to run a near neighbor analysis).

Filed under strategies

## Multivariate Independence Tests

Thought I should mention the obvious multivariate independence tests.   These can be used in a 1st pass to trim out the obviously independent variables.   Assume a d-dimensional joint distributions p(x0, x1, x2, … xd) as a starting point with sample vector X = { x0, x1, x2, .. xd }.

Let us consider a subdividing of the above d dimensional distribution into n and m dimensional distributions respectively (where d = n+m).  We want to test to see whether the marginal distribution of n variables is independent of the distribution of m variables.   Our sample vector for the full joint distribution can be expressed as:

X = { {a1,a2, ..an}, {b1,b2,…,bm}}

and we have marginal distributions p(A) and p(B) each of which is a subset of the dimensions of p(X).   The covariance matrix for p(X) looks like:

Intuitively if A and B are independent then Σab should be 0 (well not significantly different from 0).   We set up the null hypothesis H0 to indicate independence.   The wilks’ test rejects independence if:

where N is the number of samples.   This test is valid even for non-elliptical distributions, so is a good 1st line test.   The intuition with the above measure is that the determinant ratio is measuring the volume of the covariance matrix in ratio to the volumes of the non-crossing components of covariance.   If the volume is mostly explained by the non-crossing components, then A and B are independent.

Filed under strategies

## Finding Effective Sub-distributions

I didn’t get around to talking about dimensionality reduction in the previous post.   I have a number of related problems I am interested in solving.    Setting up the problem, let us say that we have a distribution p (E | C), where both E and C are sets of the same dimension (d):

E = {e1, e2, e3, … e[d]}
C = {c1, c2, c3, … c[d]}

d is a high dimension >> 1000.

Problem 1: Sub-distributions with C and E of same dimension
In the case where many of the variables in the dimension are independent, the joint probability of  a specific 1000 variable state, will often be much less than the probability of a lower dimensional state variable (of say 20 variables with covariance).

The question is then, what dimensional subsets of this distribution are the most predictive.   Another way to state this is what subsets of C provide the most information transfer into  E.

It would be then of interest to look at the top N subsets of the distribution that carry the most information from C → E.

Problem 2: Sub-distributions with C and E of different dimension
As in problem 1, we look for subsets of C that carry the most predictive information for subsets of E.   In this case, though, C and E need not be of the same dimension.  Combinatorially this is a harder problem than problem #1, however the math is the same.

Outline of Approach

1. spanning graph structure is determined between variables to avoid astronomical combinations
2. transfer entropy analysis is done on edges in spanning structure
3. subgraphs of depth 1, 2, 3, … are evaluated from a transfer entropy point of view to determine the power of each subgraph
4. subgraphs are sorted by transfer entropy

Alternatives
There are undoubtedly many approaches to finding factors, particularly with classification problems where E is of small dimension (often 1 dimension).   I’d be grateful for alternative suggestions.

Filed under strategies

## Transfer Entropy

I am revisiting spanning trees & clusters that express relationships amongst assets.   I am also interested in a related problem: reducing dimensionality in a high-dimensional distribution of asset returns.

Linear Approaches
The most naive approach is to look at the correlations amongst time-series.   Correlations have a number of well-known problems for this purpose:

1. correlation ≠ causality
2. correlations can occur between completely unrelated variables for arbitrary sample periods
3. correlations are a linear measure of similarity

The VECM model offers significant improvement over the correlation approach, at least in terms of identifying causality.   For those unfamiliar with VECM it is similar to an ARMA model, but extended to include lags from the other timeseries.   For 2 variables, a lag-p VECM would be set up as:

That is Δx is described in terms of lagged Δx’s and lagged Δy’s.   Solving for this (usually in matrix form), one arrives at coefficients (assuming statistical significance) which on-average describe the interactions between the series X and Y.

Taking this a step further, one can do the “Granger Causality test”, doing a F-test to determine whether Δx with cross-lags produces significantly less error variance than Δx without cross-lags.   This is performed for various lags to determine the minimum number of lags for which there is “causality” (or none at all).

This is not a bad approach for normally distributed returns, but is flawed for data with non-linearities.

Information Theory
It turns out information theory provides a powerful tool in analyzing causality (or at least temporal flows of information from one market to another).

Shannon measured the information for a particular event “e” as:

Let us associate a symbol with each possible distinct event in a system {A, B, C, … }.  A sampling of these events across time will lead to a sequence of symbols (for example:  ABAAABBBBABABAA).   If the symbol B occurs with p(B) = 1 (i.e. BBBBBBBB), the sequence can carry no information as can only represent 1 state.   Note that I(B) would be 0 in this case.

Shannon went on to define the entropy of the system as the expected information content:

This can be extended to look at the joint entropy of for 2 or more “symbol” generators such as:

Observing that if X and Y are independent, p(x,y) = p(x)p(y), we can determine how much information has been introduced into the joint event space versus the amount of information were the two sequences independent as:

The above is called the “Mutual Information” measure.   This measure does not differentiate between information X provides Y or Y provides X.   In the context of finance it is useful to know more about the directional flow of information than that they simply share information.

Transfer Entropy
Transfer Entropy is a more precise measure than Mutual Information in that it captures information flow direction and temporal relationship.  The Transfer Entropy approach is a nearly 1:1 analog with Granger causality, except that it is applicable for a wider range of systems (as it turns out granger causality and transfer entropy have been shown to be equivalent for data with normally distributed noise).

Like Granger Causality (GC), we look at the entropy (or in the case of GC: error variance) with and without an explanatory variable from the other series.   For a single lag, this results in the following measure:

The above expressed the transfer entropy of y[t] on x[t+1], i.e. how much impact does y[t] have on x[t+1].   Changing the conditional probabilities to express p(y[t+1] | x[t], y[t]) would allow us to explore the other direction.   Of course this can be evaluated for more lags (the above is just for 1 lag).

Finally one needs to consider the level of significance for a given transfer energy to understand at which point there is no further relationship when looking at past lags or other variables.   The approach taken is to measure the baseline entropy in a shuffled series (one that removes the correlations but maintains the symbols and marginal frequencies).

This approach is much more robust than granger if the data set one is working with has non-linearities.