Causality in observational data

In the past had created clusters on assets to help identify relationships across assets.   The resulting graphs were useful in identifying assets for use in mean-reverting portfolios.   A difficulty in the approach was always around accurately measuring the strength of relationships between asset pairs.   I had looked at Granger-causality, which is fairly limited in that expects asset relationships to follow a VECM-like model, and eventually settled on weighting across a number of techniques as an approximate.

Determining causality for more general relationships requires a very different approach, where Y ← f(X) + ε.   i.e. f(x) may be an (unknown) non-linear function on X.   I came across an interesting paper: “Distinguishing cause from effect using observational data: methods and benchmarks” (link) which builds on work around looking at the asymmetry of the “complexity” of p(Y|X) versus p(X|Y), where if  Y ← X (X causes Y), p(Y|X) will tend to have lower complexity than p(X|Y).

The paper provides results on two methods: Additive Noise Methods (ANM) and Information Geometric Causal Inference (IGCI), where ANM generally did better across a variety of scenarios than IGCI.

The algorithm for ANM in python-like pseudo-code (taken from the paper):

def causality(xy: DataFrame, complexity: FunctionType, complexity_threshold: float):
    n = nrows(xy)
    xy = normalize (xy)
    training = sample(xy, n/2)
    testing = sample(xy, n/2)

    ## regress Y <- X and X <- Y on training set
    thetaY = GaussianProcess.regress (training.y, training.x)
    thetaX = GaussianProcess.regress (training.x, training.y)

    ## predict Y' <- X' and X' <- Y' & compute residuals on testing set
    eY = testing.y - GaussianProcess.predict (testing.x, thetaY)
    eX = testing.x - GaussianProcess.predict (testing.y, thetaX)

    ## determine complexity of each proposition on variable and residuals
    Cxy = complexity (testing.x, eY)
    Cyx = complexity (testing.y, eX)
    ## determine whether Y <- X, X -> Y, or indeterminate
    if (Cyx - Cxy) > complexity_threshold:
        return X_causes_Y
    elif: (Cxy - Cyx) > complexity_threshold:
        return Y_causes_X
        return None

The paper recommended using a scoring of the Hilbert-Schmidt Independence Criterion (HSIC) as the complexity test. I will have to code the above up and see how it does on a broad set of assets. The paper did test the CEP benchmark which includes some known financial relationships.


Am quite busy now with another investigation, but will revisit this with a proper implementation.



Filed under strategies

3 responses to “Causality in observational data

  1. Gary

    Was going to do something similar but I fear the noise levels may be too high

    • tr8dr

      Could well be. However, 3 of the tests in the CEP suite are on financial returns. I suspect these are daily returns, so a bit more smoothed out relative to intra-day. Would not expect to get a good read on intra-day as leaders and laggers can switch fairly easily.

      I should implement this and see how well it does. Quite busy at the moment so will have to get to it later. The post was more of a reminder for me to investigate more thoroughly later …

  2. Pingback: The Whole Street’s Daily Wrap for 12/26/2014 | The Whole Street

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s