# Transfer Entropy

I am revisiting spanning trees & clusters that express relationships amongst assets.   I am also interested in a related problem: reducing dimensionality in a high-dimensional distribution of asset returns.

Linear Approaches
The most naive approach is to look at the correlations amongst time-series.   Correlations have a number of well-known problems for this purpose:

1. correlation ≠ causality
2. correlations can occur between completely unrelated variables for arbitrary sample periods
3. correlations are a linear measure of similarity

The VECM model offers significant improvement over the correlation approach, at least in terms of identifying causality.   For those unfamiliar with VECM it is similar to an ARMA model, but extended to include lags from the other timeseries.   For 2 variables, a lag-p VECM would be set up as:

That is Δx is described in terms of lagged Δx’s and lagged Δy’s.   Solving for this (usually in matrix form), one arrives at coefficients (assuming statistical significance) which on-average describe the interactions between the series X and Y.

Taking this a step further, one can do the “Granger Causality test”, doing a F-test to determine whether Δx with cross-lags produces significantly less error variance than Δx without cross-lags.   This is performed for various lags to determine the minimum number of lags for which there is “causality” (or none at all).

This is not a bad approach for normally distributed returns, but is flawed for data with non-linearities.

Information Theory
It turns out information theory provides a powerful tool in analyzing causality (or at least temporal flows of information from one market to another).

Shannon measured the information for a particular event “e” as:

Let us associate a symbol with each possible distinct event in a system {A, B, C, … }.  A sampling of these events across time will lead to a sequence of symbols (for example:  ABAAABBBBABABAA).   If the symbol B occurs with p(B) = 1 (i.e. BBBBBBBB), the sequence can carry no information as can only represent 1 state.   Note that I(B) would be 0 in this case.

Shannon went on to define the entropy of the system as the expected information content:

This can be extended to look at the joint entropy of for 2 or more “symbol” generators such as:

Observing that if X and Y are independent, p(x,y) = p(x)p(y), we can determine how much information has been introduced into the joint event space versus the amount of information were the two sequences independent as:

The above is called the “Mutual Information” measure.   This measure does not differentiate between information X provides Y or Y provides X.   In the context of finance it is useful to know more about the directional flow of information than that they simply share information.

Transfer Entropy
Transfer Entropy is a more precise measure than Mutual Information in that it captures information flow direction and temporal relationship.  The Transfer Entropy approach is a nearly 1:1 analog with Granger causality, except that it is applicable for a wider range of systems (as it turns out granger causality and transfer entropy have been shown to be equivalent for data with normally distributed noise).

Like Granger Causality (GC), we look at the entropy (or in the case of GC: error variance) with and without an explanatory variable from the other series.   For a single lag, this results in the following measure:

The above expressed the transfer entropy of y[t] on x[t+1], i.e. how much impact does y[t] have on x[t+1].   Changing the conditional probabilities to express p(y[t+1] | x[t], y[t]) would allow us to explore the other direction.   Of course this can be evaluated for more lags (the above is just for 1 lag).

Finally one needs to consider the level of significance for a given transfer energy to understand at which point there is no further relationship when looking at past lags or other variables.   The approach taken is to measure the baseline entropy in a shuffled series (one that removes the correlations but maintains the symbols and marginal frequencies).

This approach is much more robust than granger if the data set one is working with has non-linearities.

Filed under strategies

### 8 responses to “Transfer Entropy”

1. Would you know how this approach compares with extended Granger (for nonlinear time series) and Bayesian networks or Probabilistic Graph Models?

• tr8dr

Good questions. To be honest I am not familiar with a specific extended granger form (maybe you can point me to a reference). I have seen a number of extensions to VECM / granger. Granger in my view is just testing for dependence by adjusting a VECM-like model. The F-test could be used in the context of any model, linear or non-linear. One would incrementally remove or add cross terms from your model to see whether the the cross-terms add statistical significance.

Both granger and transfer entropy tests can be used to generate a graphical model. Iterating through a set of factors you can test “causality” or information transfer. When you detect a link above a threshold in strength, add an edge between the factor and the dependent variable.

If one has a model that gets pretty close to the true underlying dynamics is definitely superior to the probability based approaches, as provides more information then just relationship. The transfer entropy approach however can tell one how much information on a % basis is coming from a source and the timing of it. I suppose you could use eigenvalues to rank the coefficients of the VECM model as well …

2. Hi,
good to see some information theory getting some coverage from quants/traders! 🙂

Something like what you describe has been covered in some papers by Molgedeya and Ebeling. They goal was to discover local order and predictability in time series using entropy. They also covered what an optimal partitioning is of the return series into symbols.

I implemented the methods of one of their papers in my blog.

• tr8dr

Thanks, took a look brief at the papers and your blog, looks interesting. Do you use Transfer Entropy in your work or other entropy measures?

With TE, one approach to sampling artifacts (particularly the ratio of number of variables to samples) is to determine a baseline “uncorrelated” TE and subtract from the calculated. Typically this is calculated with randomly shuffled independent series. i.e. effective_entropy = TE_measured – TE_shuffled. The non-repeatability and non-guarantee of no correlation bothers me with the approach. Not sure if you use TE and whether you know of a better approach for removing the artifact entropy.

3. Ahmed

I tried to use transfer entropy to build a causality map but I still have problem considering the significant level which can allow as to make decision if there is a causality relationship between two variables or not. which I think is complicated since that the value of transfer entropy is not bounded like correlation for example.
I hope that you can help in this problem thanks a lot.

• tr8dr

Hi, I’ve been away from this topic for some time due to other things on the agenda. The entropy based approach certainly has difficulties. You’ll do better by trying to determine an approximate embedding / phase space based equation that relates 2 or more assets. Determine the MLE parameters for the equation and then determine based on various stat tests whether the system has statistical significance.