Causality in observational data

In the past had created clusters on assets to help identify relationships across assets.   The resulting graphs were useful in identifying assets for use in mean-reverting portfolios.   A difficulty in the approach was always around accurately measuring the strength of relationships between asset pairs.   I had looked at Granger-causality, which is fairly limited in that expects asset relationships to follow a VECM-like model, and eventually settled on weighting across a number of techniques as an approximate.

Determining causality for more general relationships requires a very different approach, where Y ← f(X) + ε.   i.e. f(x) may be an (unknown) non-linear function on X.   I came across an interesting paper: “Distinguishing cause from effect using observational data: methods and benchmarks” (link) which builds on work around looking at the asymmetry of the “complexity” of p(Y|X) versus p(X|Y), where if  Y ← X (X causes Y), p(Y|X) will tend to have lower complexity than p(X|Y).

The paper provides results on two methods: Additive Noise Methods (ANM) and Information Geometric Causal Inference (IGCI), where ANM generally did better across a variety of scenarios than IGCI.

The algorithm for ANM in python-like pseudo-code (taken from the paper):

```def causality(xy: DataFrame, complexity: FunctionType, complexity_threshold: float):
n = nrows(xy)
xy = normalize (xy)
training = sample(xy, n/2)
testing = sample(xy, n/2)

## regress Y <- X and X <- Y on training set
thetaY = GaussianProcess.regress (training.y, training.x)
thetaX = GaussianProcess.regress (training.x, training.y)

## predict Y' <- X' and X' <- Y' & compute residuals on testing set
eY = testing.y - GaussianProcess.predict (testing.x, thetaY)
eX = testing.x - GaussianProcess.predict (testing.y, thetaX)

## determine complexity of each proposition on variable and residuals
Cxy = complexity (testing.x, eY)
Cyx = complexity (testing.y, eX)

## determine whether Y <- X, X -> Y, or indeterminate
if (Cyx - Cxy) > complexity_threshold:
return X_causes_Y
elif: (Cxy - Cyx) > complexity_threshold:
return Y_causes_X
else:
return None

```

The paper recommended using a scoring of the Hilbert-Schmidt Independence Criterion (HSIC) as the complexity test. I will have to code the above up and see how it does on a broad set of assets. The paper did test the CEP benchmark which includes some known financial relationships.

Am quite busy now with another investigation, but will revisit this with a proper implementation.

Filed under strategies

3 responses to “Causality in observational data”

1. Gary

Was going to do something similar but I fear the noise levels may be too high

• tr8dr

Could well be. However, 3 of the tests in the CEP suite are on financial returns. I suspect these are daily returns, so a bit more smoothed out relative to intra-day. Would not expect to get a good read on intra-day as leaders and laggers can switch fairly easily.

I should implement this and see how well it does. Quite busy at the moment so will have to get to it later. The post was more of a reminder for me to investigate more thoroughly later …