# Embedding

One of my strategies uses a nearest neighbor approach on a set of multivariate vectors.   I had not formalized the process of choosing the variables / composing the vectors in the model.  It turns out that chaos theory provides a strong grounding and closely relates in that both use near-neighbor analysis extensively.

Given a multivariate timeseries (say 2 variables x,y), is there some “embedding” of:

such that Xt+1 can be predicted by a stationary or locally stationary mapping θ based on the prior embedding tuple:

This formalism is used in non-linear dynamics to reconstruct the phase space of chaotic dynamic systems that have periodicity around an attractor.   In finance there are multiple attractors:

1. the mean is an attractor
In the case of returns, 0 tends to be the attractor
2. long run elliptical distribution
Short-term observed distributions come in various forms, but tend to revert to elliptical over longer periods.

With embedding one needs to determine the variable set, the number of lagged values (the embedding dimension m) and the embedding lag τ.  The idea with the embedding is that it provides exactly enough information to construct the reproducing function.

Here is an example of the Rossler attractor.  Looking at the samples of one of its dimensions (X) we see the following timeseries:

We are able to reconstruct the phase space with an embedding of 3 and optimal lag of 8:

In practice the trajectory of the embedding should have the following attributes

1. spreads out the trajectory smoothly through its phase space hypercube
2. keeps periodic trajectories close to each other in space
3. Most importantly reduces the entropy in its phase space

The optimal choice of embedding makes the mapping of θ all the more effective.   I won’t go into all of the theory of the above, would involve a text.   However, there are some problems to be solved.    How does one choose an optimal embedding (dimension and lag) and what about the mapping function?

Standard Approach to Determining Optimal Embedding
Surprisingly the standard approach seems to be rather ad-hoc.   The “false nearest neighbor” algorithm is usually used to determine the embedding dimension.  The algorithm is applied to successively higher dimensions to find the minimum “false neighbors”.   The nearest neighbor for a given point in time on a trajectory should be spacially within some epsilon, otherwise is considered not to be a true neighbor.

Rather than go into what that means and the algorithm, the problem is that one has to choose appropriate epsilon and also this only optimises one variable of the problem (embedding dimension).   Tau is not completely orthogonal to embedding dimension.

Tau in turn is traditionally computed using mutual information or correlation dimension analysis.   With precautions one can use these, but is not a good approach if automation is required.

Information Theoretic Approach
Gautama, Mandic, and Van Hulle provided a new approach to selecting optimal embedding dimension m and lag τ which optimises these together and with less subjective measures.   Their approach uses the Kozachenko-Leonenko estimate of differential entropy:

where the |·| portion is the euclidean distance between the ith vector and its nearest neighbor.   Ce is the Euler constant 0.5772.

The notion is that the optimal choice of dimension and lag will have minimal entropy.   We can parameterize as H(x,m,τ) to provide a measure on the data for differing dimension and lag.

Finally due to bias introduced with increasing dimension in the context of autocorrelation, the value needs to be normalized with an average of whitened (randomly shuffled) series. In their implementation they produce K=5 whitened series and take the average.

Ultimately the measure is as follows:

This function can then be minimized for m and τ.

Mapping Function
I will discuss mapping functions at a later date.   For noisy data, the most common approach is to apply a polynomial basis across the X[t-1] near-neighbor successors to arrive at a next period estimate.

Advertisements

9 Comments

Filed under strategies

### 9 responses to “Embedding”

1. Very nice. Are you using RTisean for this analysis?

• tr8dr

Actually I was using tseriesChaos by the same author (I’m not sure why he has 2 diff packages with overlapping functions).

RTisean looks to have a few more functions.

Either of these is a good start for playing around with the problem, but ultimately have to write code outside of R due to performance considerations.

The algs presented in these 2 packages are fairly basic.

2. tr8dr

@Shane. Actually I know why this package is being rewritten. RTisean is based on the external TISEAN application (did not recognize that at first).

TISEAN looks to have some good algorithms (for surrogates for instance) but is a bit outdated in other areas.

Unfortunately these algorithms are not at all straightforward and still an active area of research. So have to guess the progress for the tseriesChaos package will be a slow.

3. That’s the “problem” with living on the bleeding edge. 🙂

4. Anton

Hi

Can you explain why did you normalize the differential entropy (the autocorrelation issue)?

5. Scott Locklin

I had sort of looked at TSEAN a few times, but was never impressed with what I saw there. Thanks for reminding me to look again.
I suspect you’re making it more complex than it needs to be. I have pretty good luck with just plain “Brute searching” for lag + embedding dimension. My NN gizmo is extremely fast, though.
I’m really enjoying your blog; thanks for sharing all this nice stuff.

• tr8dr

Actually I decided against using the above approach. Found the whitening to be problematic or overly complex to do properly (there are approaches that bring the data into the freq domain and whiten there — gets complex quickly).

Have an approach that uses a variant of FNN that I wrote to determine.

With your brute force approach, how are you measuring the effectiveness of the embedding? Are you using an information measure or just evaluating your predictions for a given embedding against the observed?

I considered brute force, but then the number of variables and observations I have is large. It may work out better than the other approaches though, so curious.

• Scott Locklin

FNN -I had not heard of this. The dude took the same approach I did, but he apparently didn’t figure out how to make libANN really go fast. I was going to port my stuff there, but maybe his gizmo is good enough to get work done when I must live in R land.

For now, I’m just using my “goodness of whatever” measure (MAPE, fraction same sign, whatever) to see if I am doing well. There might be bias issues, but the last time I looked at this, just throwing lots of lags at the problem works pretty well. I might give your approach some attention at some point, but I have a lot of other stuff I want to try.

Also: consider local regression on your NN neighborhoods. It ain’t real expensive, and you can sometimes get more signal. If you’re using libANN, you’re getting the distances for free also, and so you can weight the local regression.

TSEAN is *old* -we’re talking back when the Santa Fe guys were cutting edge of awesome. Of course, using this trick is kind of hearkening back to those days. I never really understood why people don’t talk about it more. Hopefully because they’re too busy making money.

• tr8dr

re: TSEAN, agree. Of the 2 R packages found for the non-linear stuff, one uses TSEAN and the other is a port of some of its functionality. TSEAN at this point looks quite dated. It is also not very useful to me as would need to integrate into my development.