## Embedding

One of my strategies uses a nearest neighbor approach on a set of multivariate vectors.   I had not formalized the process of choosing the variables / composing the vectors in the model.  It turns out that chaos theory provides a strong grounding and closely relates in that both use near-neighbor analysis extensively.

Given a multivariate timeseries (say 2 variables x,y), is there some “embedding” of: such that Xt+1 can be predicted by a stationary or locally stationary mapping θ based on the prior embedding tuple: This formalism is used in non-linear dynamics to reconstruct the phase space of chaotic dynamic systems that have periodicity around an attractor.   In finance there are multiple attractors:

1. the mean is an attractor
In the case of returns, 0 tends to be the attractor
2. long run elliptical distribution
Short-term observed distributions come in various forms, but tend to revert to elliptical over longer periods.

With embedding one needs to determine the variable set, the number of lagged values (the embedding dimension m) and the embedding lag τ.  The idea with the embedding is that it provides exactly enough information to construct the reproducing function.

Here is an example of the Rossler attractor.  Looking at the samples of one of its dimensions (X) we see the following timeseries: We are able to reconstruct the phase space with an embedding of 3 and optimal lag of 8: In practice the trajectory of the embedding should have the following attributes

1. spreads out the trajectory smoothly through its phase space hypercube
2. keeps periodic trajectories close to each other in space
3. Most importantly reduces the entropy in its phase space

The optimal choice of embedding makes the mapping of θ all the more effective.   I won’t go into all of the theory of the above, would involve a text.   However, there are some problems to be solved.    How does one choose an optimal embedding (dimension and lag) and what about the mapping function?

Standard Approach to Determining Optimal Embedding
Surprisingly the standard approach seems to be rather ad-hoc.   The “false nearest neighbor” algorithm is usually used to determine the embedding dimension.  The algorithm is applied to successively higher dimensions to find the minimum “false neighbors”.   The nearest neighbor for a given point in time on a trajectory should be spacially within some epsilon, otherwise is considered not to be a true neighbor.

Rather than go into what that means and the algorithm, the problem is that one has to choose appropriate epsilon and also this only optimises one variable of the problem (embedding dimension).   Tau is not completely orthogonal to embedding dimension.

Tau in turn is traditionally computed using mutual information or correlation dimension analysis.   With precautions one can use these, but is not a good approach if automation is required.

Information Theoretic Approach
Gautama, Mandic, and Van Hulle provided a new approach to selecting optimal embedding dimension m and lag τ which optimises these together and with less subjective measures.   Their approach uses the Kozachenko-Leonenko estimate of differential entropy: where the |·| portion is the euclidean distance between the ith vector and its nearest neighbor.   Ce is the Euler constant 0.5772.

The notion is that the optimal choice of dimension and lag will have minimal entropy.   We can parameterize as H(x,m,τ) to provide a measure on the data for differing dimension and lag.

Finally due to bias introduced with increasing dimension in the context of autocorrelation, the value needs to be normalized with an average of whitened (randomly shuffled) series. In their implementation they produce K=5 whitened series and take the average.

Ultimately the measure is as follows: This function can then be minimized for m and τ.

Mapping Function
I will discuss mapping functions at a later date.   For noisy data, the most common approach is to apply a polynomial basis across the X[t-1] near-neighbor successors to arrive at a next period estimate.

Filed under strategies

## Causality & Transitivity

I have a number of different graphs of causality relationships amongst equities (one measure I use is granger based).   I was interested in knowing to what extent if  A ↔ B ↔ C implies a weak A ↔ C, where am using ” ↔” to indicate granger causes in either or both directions.

In most cases the spearman correlations of A ↔ B are high, so thought as a very rough analog to observe the falloff in correlation as one progresses farther along the relationship path.   So if one has a graph of “directs” from A with associated correlations to A, what are the correlations of  the directs directs back to A, and so on:

• A
ρ(A,A) = 1
• A ↔ {B,C,D}
ρ(A, {B,C,D}) =  0.84, 0.76, 0.90
• B ↔{E,F}, C↔G, D↔{H,I,J}
ρ(A, {E,F,G,H,I,J})  = 0.62, 0.54, 0.59, …

Well, it turns out that the correlation falls off dramatically.  This does not prove anything directly about causality, but seems to imply that  transitivity is limited or near non-existant or at least falls off very quickly.

Here is an analysis done for 2 different equities:  Filed under strategies

## Graph Visualization

I am looking for a good application or library for graph (as in network) visualization.  I need to be able to visualize very large graphs so the static graph rendering approach does not work.   Want something dynamic where can focus on local context while navigating the graph.  Something a bit more sophisticated than this for instance.    I’d like to have a notion of weight and direction on the graph.   Ideally should be able to take graphml as input.

Ran across some visualization libraries:

1. Protoviz (beautiful rendering and graph ideas)
It looks like a bit of work, but with some code could probably use this library as a basis for interactive exploration.   Perhaps too much work for my busy schedule.  There are a number of examples that give me ideas for other visualizations.
2. JUNG (I already use this for graph analysis, but not for visualization)
Visualization is not terribly pretty but functional.   Is not dynamically oriented.
3. ThinkMap
These guys were one of the first I saw on the web (maybe 10 years ago) with a dynamic graphing technology.  It was flash based at the time.  Now it looks like they are using Java.   Strangely, their java UI is much inferior to their prior flash API.
4. Flare (Flex / Flash based API)
The graph layouts appear to be dynamic, but require quite a bit of code.  The code is actionscript based.   Hmm, don’t want to go there.

I’m sure there are others.   Just some random ones ran across …

Shane Conway pointed me to his rwebvis package for R that provides a R interface to rendering with Protoviz (amongst other web based renderers).   Looks like it is just what I need (thanks)!   Protoviz requires some programming, however the level of abstraction seems about right for something this flexible.

Filed under strategies

I was expecting some differences in the profiles for differing equities.  Some of the factors that may provide differences:

1. Liquidity
2. Domestic or Foreign
3. Institutional vs Individual Investor Holdings
4. Industry?

I chose Apple (AAPL) and 2 other stocks at random with liquidity / trading volume declining from left to right { AAPL, ARO, CHT }.  Here are the Up Day charts for the high and low:  Of course looking at just one asset may not give a good read.   We may want to run on pools of stocks for a given liquidity and industry to give a better overall read.

Filed under strategies

## Execution Profile

I have a number of models that forecast a next period return for a portfolio of assets.   It is important to have a view on how the asset trades on average over the period.  On average:

1. where are the lows & highs made
2. how much time is spent above or below the close (useful to know if close is your target)
3. what is the average volume profile for the asset

There is a large business on wall street built around executing large size for clients to a target measure (VWAP is an example of this).  Execution algos focus on  how to execute size with the least amount of market impact and at the lowest cost (best price & transaction cost).

I don’t have experience in the execution algo side, having focused mostly on signal generation (so if there any readers with suggestions please let me know).    That said, here are some simple analyses I did on 1 year of tick data across 2700 US equities;  Definitely gives me a better picture.

Up Day
Supposing our model predicts that an asset will close higher by a certain % on the next day and we enter at a price close to the prior close.   We now want to know when is the best point(s) to sell the position.   If the position has size, we may have to sell at multiple points during the day.

The following graph looks at the time of day on average where the high is reached, for different close-to-close % return scenarios (neutral indicates a close close to 0%, between +/- 0.3%):

For very sharp close-to-close appreciation (> 2% return), the high is usually achieved in the last hour.  In the neutral case we see the high achieved in the first 1/2 of the day, with the most weight around 10:30.   The medium returns (≤ 2%) seem to be more split between the opening and closing sessions (but otherwise pretty uniform). Interestingly, we would not want to be selling near the open (9:30 – 10:30) for a high return target.   The other return profiles show elevated probability of low in the morning session, but otherwise close to a uniform distribution.

Down Day
Supposing our model predicts that an asset will close lower by a certain % on the next day and we short at a price close to the previous close.   We are now looking for the right time to close out the short. The profile looks very similar to our Time Of Low (Up Day) graph above.  As expected the short side mirrors the long side. % Of Time Above (Below) Close
If the closing price is our target then it is useful to know how much time and perhaps # of opportunities at or exceeding target. The extreme returns (> 2% in olive) and (< -2% in green) show 30-40% probability that the asset only spends 20% of the trading day on or above (below) the close, peaking in probability at 7-8% or maybe 1/2 hour of the day.   This is coincident with the time-of-high (low) behavior that we see above as well.

We can also look at the number of “touches” on or above the closing price we realize.   This distribution seems to be more uniform.

Conclusion & Further Work
I suspect that there are differences in price movement profiles for different groups of assets.   For instance the lower liquidity stocks with less frequent trading may have a very different profile.   I know that execution algo desks measure volume and other indicators on a per-asset basis, maybe with some grouping overlays.

Another analysis that would be useful would be to see volume-weighted price profiles.   For example, what is the average volume weighted price path for differing closing return targets.

Filed under strategies

## Tale of 2 Strategies

As mentioned in the previous post, the market of the current year has been harder for some mean-reversion (and probably trend following) strategies.     I have a family of non-parametric adaptive MR strategies for baskets of equities.

I had been trading a strategy this year (on and off)  that declined in growth substantially from prior years due to the changes in the market.  Here is the 3 year cumulative return (note that this is not reinvested return, rather trading a fixed amount of capital): The strategy profitability has grown much more slowly since late 2009.   Here is a blowup of the YTD where you can see that it has done about 20% YTD: The strategy appears to be more appropriate for trending markets rather than the sideways market we’ve been in.   I have another MR strategy with less spectacular 3 year returns, but appears to fit this market better with fairly smooth returns YTD and more rugged returns in the higher volatility regions (2008) in the past: A blowup of the YTD trajectory shows a 80% gain YTD: The volatility at the beginning of the year was high with this strategy.   The question becomes when to switch from one strategy to another or how to adjust allocation weights on the strategies, trading both.

Filed under strategies

## Mean Reverting Strategies

Recently there have been some discussions on one of the quant groups, where someone indicated “mean reverting strategies no longer work”.    I don’t agree with this at all.    However, there has been a lot of risk aversion and the market movements of the last few months (well since Feb or March) have been quite different then those preceding.    The market of the last few months has been ideal for market making given the tight bands and much less favorable for strategies that rely on trends or big swings.

One of my strategies is quite simple and depends on mean-reverting behavior across a large portfolio.   When MR behavior in the market is “cranked back” it makes sense to filter the portfolio, increasing the chance of successful trades and reducing possible losses from assets without the desired footprint.

I decided to do some tests on the biggest winners and losers by asset to confirm my belief that wining assets would exhibit strong negative autocorrelation and losing assets would likely exhibit positive or statistically not-significant levels of autocorrelation (+/-).  Indeed one of the biggest losing assets had the following profile: The above asset spends most of the time in positive auto-corr territory and does not tend to mean-revert over the window of interest.    The asset below spend most of its time in negative auto-corr territory and hence is very strongly mean-reverting: Of course measures like this have lag associated with them, but if chosen carefully can be used to effectively filter assets on the run.   Here is the test code in R: Filed under strategies

## Evaluating High Dimensional Expectation

Here I am going to outline a simple approach to evaluating a high dimensional discrete expectation on empirical data.   For example, if one considers the evaluating an expectation like: Let’s say X is of dimension 10 and we have a 10000 x 10 matrix of observations.   Furthermore each of the 10 variables can take 5 possible states:

X0 = { 1,4,4,1,1,1,1,3,3,2 }
X1 = { 1,2,2,2,1,1,1,5,5,2 }
….

The total number of possible distinct state vectors is 5^10 or over 9 million.   A naive approach to the above integral is to consider all possible sequences of X, but this would be enormously expensive and will not scale as the dimension or number of individual states grows.

The distribution is going to be sparse.  We know this because 5^10 >> 10,000 observations.   The worst case algorithm would involve a scan for a given pattern, calculating the frequency.   This poor approach would have complexity  O(N S^d), where N is the number of samples, S is the number of distinct symbols, and d is the dimension.

Some simple observations:

1. It makes sense to restrict the domain of the integral to existant sequences (get rid of complexity S^d)
2. It makes sense to avoid multiple passes, where naively there would be a pass for each frequency calculation

Since we are dealing with a finite sample and known symbols there is an approach that is O (N log N).   We create a binary or n-nary tree ordered lexically by symbol sequence.   Iterating through the samples, for each sample locate or insert a node in the tree, tracking its frequency.   Increment frequency. The tree is rearranged as necessary to create balance and reduce the average path length to log N.  In one pass we now know both the domain of the distribution and the frequencies.   The expectation is now straight-forward to evaluate.

Note that instead of using a tree, one can use a hash-table keyed off of sequences.   This will result in an O(N) algorithm.   For some applications, though, it is useful to have a tree that places neighbors close to each other (for instance if we want to run a near neighbor analysis).

Filed under strategies

## Multivariate Independence Tests

Thought I should mention the obvious multivariate independence tests.   These can be used in a 1st pass to trim out the obviously independent variables.   Assume a d-dimensional joint distributions p(x0, x1, x2, … xd) as a starting point with sample vector X = { x0, x1, x2, .. xd }.

Let us consider a subdividing of the above d dimensional distribution into n and m dimensional distributions respectively (where d = n+m).  We want to test to see whether the marginal distribution of n variables is independent of the distribution of m variables.   Our sample vector for the full joint distribution can be expressed as:

X = { {a1,a2, ..an}, {b1,b2,…,bm}}

and we have marginal distributions p(A) and p(B) each of which is a subset of the dimensions of p(X).   The covariance matrix for p(X) looks like: Intuitively if A and B are independent then Σab should be 0 (well not significantly different from 0).   We set up the null hypothesis H0 to indicate independence.   The wilks’ test rejects independence if: where N is the number of samples.   This test is valid even for non-elliptical distributions, so is a good 1st line test.   The intuition with the above measure is that the determinant ratio is measuring the volume of the covariance matrix in ratio to the volumes of the non-crossing components of covariance.   If the volume is mostly explained by the non-crossing components, then A and B are independent.

Filed under strategies

## Finding Effective Sub-distributions

I didn’t get around to talking about dimensionality reduction in the previous post.   I have a number of related problems I am interested in solving.    Setting up the problem, let us say that we have a distribution p (E | C), where both E and C are sets of the same dimension (d):

E = {e1, e2, e3, … e[d]}
C = {c1, c2, c3, … c[d]}

d is a high dimension >> 1000.

Problem 1: Sub-distributions with C and E of same dimension
In the case where many of the variables in the dimension are independent, the joint probability of  a specific 1000 variable state, will often be much less than the probability of a lower dimensional state variable (of say 20 variables with covariance).

The question is then, what dimensional subsets of this distribution are the most predictive.   Another way to state this is what subsets of C provide the most information transfer into  E.

It would be then of interest to look at the top N subsets of the distribution that carry the most information from C → E.

Problem 2: Sub-distributions with C and E of different dimension
As in problem 1, we look for subsets of C that carry the most predictive information for subsets of E.   In this case, though, C and E need not be of the same dimension.  Combinatorially this is a harder problem than problem #1, however the math is the same.

Outline of Approach

1. spanning graph structure is determined between variables to avoid astronomical combinations
2. transfer entropy analysis is done on edges in spanning structure
3. subgraphs of depth 1, 2, 3, … are evaluated from a transfer entropy point of view to determine the power of each subgraph
4. subgraphs are sorted by transfer entropy

Alternatives
There are undoubtedly many approaches to finding factors, particularly with classification problems where E is of small dimension (often 1 dimension).   I’d be grateful for alternative suggestions.