# Monthly Archives: November 2009

## Duration Estimation

In a prior post mentioned that for intra-day variance prediction it made sense to separate variance into 2 processes:

1. intensity process
When is the next event going to occur;  lets call this Tprior + Δt.   This is the more complex process of the two to predict.
2. power process
What is the amplitude of the event at time Tnow + Δt.   The power or amplitude process seems to be fairly well behaved.   An ARMA style process seems like a likely candidate.

Towards this end, I have been exploring models for the intensity process.   Very often this is modeled in terms of duration.   Below is a summary of some results:

ACD Models
ACD processes make overreaching assumptions.  In particular ACD models assume a constant AR decay and innovation contribution across time.   Unfortunately this is not supported by empirical observations.   Here are some results for the best-fitting Wiebull ACD model on HF data:

The R^2 level of 0.0091 does not inspire confidence.

SVR Model
I used an iterative non-parametric machine learning approach (SVR) with a training set of 20 prior observations and a lagged series of the derivatives of the prior 20 durations as the input vector.   Training across the entire series, one gets an in-sample prediction R^2 of 0.9980.   Unfortunately, incremental out of sample does not fair as well:

Distribution of Durations
Here are 2 views on the distribution of durations:

Alternative Models
Some possibilities:

1. markov chain (probabalistic state system)
We model the patterns by categorizing the durations into K separate levels.   To train we observe the chain of states, say {K1, K8, K1,K1,K1,K4} and determine a graph describing the approximate event chains, factorizing and assigning probabilities.
2. ANN
Use a simple feed-forward network, trained with a GA or DE.   This is easy to implement but subject to a variety of problems such as overfitting.

As the ANN is easy to compose, will start there.

## Durations on Intraday Price Series

As mentioned in a previous post, I intend to model quadratic variation in terms of multiple pairings of intensity (duration) and return level processes.   At a minimum want a pairing for “non-jump” related returns and a pairing for “jump” related returns.

To do this it is necessary to partition returns into the categories based on threshold.   We may further want to disregard price movements below a certain level unless they cumulatively add up to a return with significance within a period.   Towards this end my duration measurement function uses a threshold to determine whether a return is to be considered as an event or not.  In pseudocode:

```r ← {0} ∪ diff(log(series))
t ← times (series)
durations ← {}
for (i in 2:length(r))
{
# determine cumulative return since last acceptance
cumr ← <cummulative return since last event or max cum window>

# determine whether qualifying event has occurred
if (|cumr| ≥ threshold or |r[i]| ≥ threshold)
durations ← durations ∪ {t[i] - <Tlastevent>}
}
```

For the diffusion portion of the process, in this 2 second sampled data set (EUR/USD low-liquidity period), a threshold of 3e-5 (equivalent of about 1/2 pip), seems to work well:

The jump portion of the process should be set so as to capture desired jump features and not much more, here I show with a threshold of 2e-4 (equivalent to about 3 pips):

Filed under statistics, volatility

## Anatomy of a Spike

Spikes manifest in intra-day markets frequently.   These are often short-lived and associated with buying/selling programs more often than change in fundamental factors, particularly in low-liquidity periods.   In evaluating duration and variance measures was trying to determine reasonable jump thresholds.

Below is a price series demonstrating a variety of spiking behavior:

Taking a look at the region around the jump at high frequency, we note that the jump did not occur with one trade rather with multiple within a short space of time:

From a duration perspective, if we want to capture the spike as one event of a given magnitude we either need to consider the cumulative return over a given window or sample with a longer period.   Here is the same with a 2 second sample:

Here is the 2 second sampling of the original range.   With the longer sample period, the spikes in return are more evident (compare to the first graph in the post):

Filed under statistics, volatility

## Rethinking Variance

Volatility (variance) estimation is probably one of the most difficult areas of research in finance.   As mentioned in previous posts, recent research has been largely focused on duration based models for variance estimation.

I’ve done some experimentation with duration-based models, but have yet to find a satisfactory approach, but feel I am now beginning to understand what a satisfactory model might look like.    I don’t have the luxury of time to make it a full study, but hope to come up with an approach, however heuristic, that meets my needs.

There is a view that volatility (quadratic variation) can be separated into at least 2 distinct components:

• integrated variance (IV)
This is the primary return variation generated by the diffusive process component in our price process
• squared jump term
Jumps due to news or other behavior, not modeled by the diffusive process.

Together these explain what we call quadratic variance (QV):

The fundamental that underlies both “jumps” and “local volatility” is the notion of expected squared returns over a given time interval (in the continuous case infinitesimal, in real markets, discretely).   A pdf for the probability of a “jump” of size J at time t can be constructed as:

Where f(I) is the indicator of the “jump” event pdf and f(Δr|I,F) is the conditional distribution of jump size Δr given the event.  Together provide a posterior representing the probability of a price movement of Δr.   Variance can then be estimated with either the continuous or discrete form:

Thinking about “jumps” versus diffusion based volatility (as in the local vol estimates σ^2), the processes of the 2 differ in terms of intensity.   However we can still estimate both with the above expectation.   Both jumps and local vol can be estimated with the same approach, only the parameterization of the probability distribution need differ for different price jump levels (Δt).

One of the “hangups” in my recent attempts to model variance was in trying to model the timing and size of jumps as one process.   Now, what if we split this out:

1. model the duration (or intensity) of events in jump level range
Use an exponential-ACD model calibrated for duration between events in the range.
2. model the size of events within the range
Consider an autoregressive self-exciting process.  The AR component will model the reversion to the mean.

I can model variance as a set of self-exciting intensity processes, one for each range of price (return) levels.   I can the use the specialization of the f(I) and  f(Δr|I,F) conditional distributions for each price range to compute expected squared returns:

The indicator process
The indicator (or intensity) process can be modeled with an ACD model such as the Burr-ACD (which has shown to fit well empirically).   This can be calibrated for 2 sets of durations, one for “local vol” and one for “jump vol”.   Potentially we could consider sub-dividing into more levels.

The ACD process as developed by Engle and Russell specifies the duration as a mixing process:

The conditional density of Di is is then:

The ultimate form is subject to the choice of hazard function.   The general form of the ACD update equation is reminiscent of GARCH:

Where g(D) is an analog of our hazard function for the distribution. In our case we will use the Burr distribution as proposed by Grammig and Maurer (2000).  The Burr distribution density, survival and hazard functions are given by:

We use the Burr density as the distribution ξ[i] for the indicator component ε[i] in Di = ψ[i]ε[i].  We choose a density for the distribution of ψ[i] as a gamma mixture (from Guirreri 2009):

And the distribution of the indicator process ε[i] as the Burr distribution from above.   To determine the probability of duration D[i] given ψ[i], as we have  ε[i] = D[i] / ψ[i]:

Addendum: having done more exploration on this, looks like the Stochastic Volatility Duration model proposed by Ghysels,  Gourieroux, Jasiak, and Joann (2004), would offer a much better prediction capability.

The level process
The jump level process can be relatively simple.   We observe autoregressive decay across returns, both for diffusion related and jump returns.   There are gaps in between return events in both scenarios, but our mixture with the indicator process will allow us to model this.

A relatively simple autoregressive process can be used to model the event size decay from a self-exciting jump level for each range.   Given that observations will be made at non-uniform times, an exponential time-dependent AR model is probably best.   The distribution will need to be estimated.

Conclusion
Model volatility with two processes:

1. intensity process to represent periodicity / gapping behavior
2. magnitude process representing self-exciting AR behavior

Addendum: discovered that others are thinking along the same lines.   There are a couple of papers circa 2008/2009 proposing models similar to what I have outlined.

1 Comment

Filed under strategies

I’ve been looking at a number of different volatility models for both daily and intra-day variance.   In the process have looked at GARCH(1,1) for daily and have developed a number of duration based variance (DV) models for intra-day.

Daily Returns
I had concluded early on that GARCH(1,1) generally fits daily data well but fails miserably with intra-day data.    Unfortunately, for some markets, GARCH performs miserably as well.

I was looking at Canadian 2Y CMT yields over the last 10 years.   There are significant spikes in return for any given year.    For example in 2003, on the raw daily returns, GARCH does poorly (squared returns: black, sigma squared: red):

As an experiment I used (very) lightly smoothed prices as a basis for GARCH testing.  The smoothing gave continuity to the price function and to the 1st derivatives (returns).    GARCH(1,1) calibrated on the same data set, but with minimal smoothing yields a much higher degree of fit (25 x):

Of course we are now projecting sigma^2 on a lightly smoothed price series and not the original series.   Note that the magnitude of returns has reduced as well, as the smoothing allows for more gradual transitions.

Let’s look at a smaller section of returns on the raw price series:

Visually there appears to be a autoregressive component in squared returns, but involves jumps followed by periods of near zero return.   Subsequent jumps appear to follow a near linear decay in magnitude.    You will note that the last set appears to be a poor fit, however, there is a cluster of low jumps, that when considered in intensity space would have a higher combined value (ie, consider the cumulative return over a small window).

This behavior would point to modeling this as a intensity function (equivalent to duration).

Intra-day Squared Returns
Autocorrelation plots of liquid periods and illiquid periods both show significant autocorrelation out to 10+ seconds, indicating that we should be looking at an AR process at some level with long memory:

Thus far the most successful model is a duration based hawkes process which is “infinitely” autoregressive with exponential decay.

Filed under statistics, volatility

## Calibrating Non-Convex Functions

Many of the calibrations I perform are on non-convex functions, functions with multiple local minima/maxima.    Solving a min(max)imization for this class of function precludes a wide range of algorithms based on gradient descent (such as BFGS).    There are a number of “metaheuristic” approaches to solving non-convex problems, such as: simulated annealing, genetic algorithms, differential evolution, particle swarm, and brute force grid search, amongst others.

For the last couple of years I have been using JGAP, a library for optimization via genetic algorithm.    For some time was under the impression that most GA libraries were similar in terms of core algorithm and just different at the API level.   I ran into some problems with the new variance model I am working on, where JGAP was not converging in the expected manner.

I recently began using a package in R for ad-hoc calibrations (genoud) and found that it converges quickly, whereas JGAP did very poorly in comparison.   DEoptim is a very similar package (a straight implementation of differential evolution). Differential Evolution  a cousin of GA, but with a different approach to population evolution.

Differential Evolution
Differential Evolution is very similar to standard GA approaches in that both involve working towards the optimum via a randomly generated population of “trial vectors” (θ) and a directed evolution of those vectors towards the optimum.   Differential evolution is implemented as follows:

```generate Population = { θ1, θ2, θ3, ... } where θ in domain of function
determine θbest ← maxarg [θ : fitness(θ)] forall θ
repeat
foreach θi in Population
draw { θa, θb, θc } randomly from P (without duplication)
θd ← θa + F(θb - θc)
θnew ← crossover (θd, θi)

# adjust ith vector if fitness of generated superior
if fitness(θnew) > fitness(θi) then
θi ← θnew

# adjust best vector if new optimum seen
if fitness(θnew) > fitness(θbest) then
θbest ← θnew
end
until fitness(θbest) approaches target```

One of the key innovations is that the variance of the innovations (mutations) is dependent on the dispersion of the vectors.  At the outset the dispersion will be large, and as the optimization progresses, the dispersion will reduce and hence mutations will be more focused.   The choice of the function F(), operating on the difference of vectors, has various effects on the speed of convergence and exploration of the domain.  I will not go into it here.

Improvements with Hybrid Approach
Provided one has a continuous function with derivatives within the neighborhood of a trial vector, the algorithm can be improved with the use of gradient descent (such as the BFGS algorithm) to locate the true optimum within a neighborhood.    Care must be taken not to optimize too early such that the diversity of trials is lost.

Some Tests
Here are some results on the two R packages mentioned above.  Ultimately I need to get an implementation into Java, but wanted to do a comparison with existing implementations first.   I ran optimizations on a particularly pathological gaussian mixture (taken from the examples in the genoud package):

Which looks like:

It is clear that gradient descent (ascent) solvers will gravitate to one of the local maxima with the above function.   Testing DEoptim with a population of 100 and similar parameters, both arrived within 1e-4 of the answer within 10 generations (without any BFGS optimization) and 5e-6 within 60 generations.

With BFGS optimization on, genoud arrived within 1e-10 within 2 generations!   Unfortunately genoud arrived at the wrong maximum with some other (more complex) mixture function, whereas the pure DE approach arrived at the proper one.    With appropriate parameterisation can be avoided, however.

Filed under genetic algorithms

## Variance Estimators Revisited

I need to have a reasonably accurate measure of intra-day variance for a strategy I am working on.    In a previous post I indicated that while GARCH(1,1) fits daily returns well, it fits intra-day returns poorly.   In fact so poorly, that it was not possible to arrive at a non-zero maximum likelihood.

The failure of the GARCH model for intra-day is multifold.   Intraday (squared) returns exhibit:

1. random large jumps followed by sharp drop-off
2. microstructure noise
3. apparent gaps in return as a smooth AR process

Garch Revisited
I decided to take another look at GARCH;  Perhaps with some modification could make it useful as a rough measure, short of a more sophisticated model.

The “jumps” present in intra-day data have a near 0 probability in a gaussian error model.   To combat this, added a component to the GARCH model to absorb jump shocks:

The added component is J, which absorbs the excess return above a threshold.   The above basically says that the jump has no predictive value for subsequent measures of the squared return (though this is not strictly what is observed).   More likely, the jumps should be considered as a largely independent random process contributing to overall variance.

The threshold was determined heuristically as:

With the above formulation was able to calibrate to a relatively high likelihood and R^2 of around 0.90 in sample and 0.85 out of sample, that is with truncated jumps.

Duration Based Approach
For a Brownian motion based process, we know that increments Δr over time interval Δt scale as follows, and that one implies the other:

Basically implying that we can estimate the amount of time it will take for Δr to be realized in the price (returns) process.   Engel and Russell developed a well regarded model for price duration, called the Autoregressive Conditional Duration model (ACD).   The model expresses price movements in terms of duration (or alternatively intensity), though does not explicitly provide a variance estimate.

For the purposes of this model, let us assume a granularity constant Δr.  We then let D[t] denote the effective span of time took for the process to move by Δr between [t, t-1].  This is calculated as Δr divided by the change in return (approximately).   For example if Δr is the return represented by 1 pip, and a 3 pip return is seen over a period of 1 second, D[t] would be 1/3rd.

In terms of intensity we can model this as:

We would like to get variance into a form were it can estimated in terms of intensity.   Given that variance is expressed as the expectation on square of returns, we can formulate this probabilistically:

We can use the Poisson pdf to model p(k jumps | λ), and evaluate the discrete integral up to some maximum jump size m.

Handling Jumps
Empirically, it would seem that jumps follow a different process from non-jump price movements.   To address this I’ve added a separate process for jumps, with a single factor guiding exponential decay.   The two processes are combined when the expectation is calculated:

The messy bit is deciding, given a return, whether to consider a jump J or a standard return D (both as a durations).   We can decide that:

D = max (price duration for [t-1,t], jump threshold)
J = max (price duration for [t-1,t] – D, 0)

or decide to categorize the whole return as a jump or normal return.

Seasonality
We might also want to adjust the constant to adjust to different levels depending on the prevalent level for a given period of the trading day.   Durations will be short (variance high) at the opens and closes for instance.   This would require building a seasonality curve for the trading day.

In the FX markets, different trading sessions offer varying degrees of liquidity.   With reduced liquidity we can see periods with very little price action and large (often temporary) jumps in price.  The autocorrelation of intensities is likely to differ in these sessions, not to mention the base intensity.   A markov switching model  may be the way to go to specialize for the different market conditions.

Results
I am now implementing this.   Will see how it compares to the modified GARCH process.   May also look at some variants for the jump process.

Filed under statistics, strategies, volatility

## Regression Approaches

One of the readers of this blog (skauf) suggested looking into KRLS (Kernel Recursive Least Squares), which is an “online” algorithm for multivariate regression closely related to SVM and Gaussian Processes approaches.    What struck me as clever in the KRLS algorithm is that it incorporates an online sparsification approach.  Why this is important, I will attempt to explain briefly in the next section.

The sparsification approach is similar in effect to PCA in that it reduces dimension and determines the most influential aspects of your system.  I had been thinking about a combined PCA / basis decomposition approach for some time, and it struck me that KRLS might be the way to do it.

Kernal-based Learning Algorithms
KRLS and SVM algorithms use a common approach to classification (or regression).   In the simplest case, let us assume that we want to find some function f(x) that will classify n-dimensional vectors as belonging to one of 2 sets (apples or oranges), which we will distinguish by + and – values from f(x):

Assuming the vectors X are linearly separable, we could try to find a hyperplane on the n-dimensional space that would separate the vectors such that there is a balance of distance between the two categories (and one that is maximal subject to some constraints).   The distance between the hyperplane (or in 2 dimensions a line) and the points is determined by a margin function.

Optimization is accomplished by maximizing the margins subject to constraints.  The optimization solves for the weights α  such that the sum of the weighted dot product of  X with each Xi yields the predicted value.  The form of f(x) is the following:

Now the above graph shows a very well-behaved data set, with a clear boundary for classification.  Many data sets, however, will have significant noise in the data and/or outliers that refuse to be categorized correctly.   This data has an overall impact on the regression line.

Supposing we have N samples in our training set {{X1,Y1}, {X2,Y2}, … {Xn, Yn}}.   In simple terms, sparsification is the process of removing (or discounting) samples that are already represented in the existing set, reducing the bias of the regressor.   One approach to determining the degree of representation is to observe the “degree of orthogonality” represented by a given vector with respect to the existing set.   Sparsification also points to an approach to evolving the regressor over time relative to new observations.

The Kernel
Above gave a brief overview of how SVM uses the notion of margin to classify data.  A kernel is not necessary for data that is linearly separable.   The kernel is “simply” a function that maps data from the “attribute” space to “feature” space.   We design or choose the kernel function so that our data is largely linearly separable in the “feature” space and with the constraint that the covariance matrix of all possible vectors mapped from attribute to feature space will be positive semi-definite.

Since our linear SVM equations are expressed in terms of inner products, given a feature mapping function Φ(x) mapping X from non-linear space to “linearly separable space”, we can express the kernel as a function of the inner product of two vectors, later to plug in to our linear equations:

The trick is to choose a kernel function that maximizes dispersion of the data into sets that are linearly separable.

How does this relate to an explicit probability based regression?
It turns out that the process of optimizing the margin function on a Gaussian kernel  is equivalent to finding the unnormalized maximum likelihood, whereas the Gaussian Process approach makes this explicit.

Which model does one choose?
The Gaussian Process approach is strictly bayesian.   It has the upside of providing explicit probabilities / confidence measures on predictions.  If one knows the likelihoods of the desired labels with respect to the attribute vectors, the model works very well and provides more information than the SVM family.

The SVM family of approaches use a margin function to determine the similarity between vectors (for the purpose of classification).   This does not explicitly involve probabilities, but for the Gaussian kernel can be shown to be equivalent to the Gaussian Process approach.   The SVM family has the upside that one does not need to know the likelihoods.

Finally, both models allow the use of kernels to map data from a non-normal or non-linear space to a linear or gaussian space.   Some have shown that there is a degree of equivalence between the likelihood function in the Gaussian Process algorithm and the kernel function used in SVM.

Filed under AI, machine-learning, statistics

## Fly Eyes & The Future of AI

Wired had an interesting article about the successful reverse-engineering of the optic system that allows flies to detect motion patterns.   Apparently they don’t fully understand how it works, but have codified it as a system of non-linear equations mimicing the “neural network” of synapses in the fly’s optic center.   I have not been able to find the paper on the web, though have found the abstract.

True AI
I have often thought that the first successes in creating an “intelligent” AI will be accomplished by mapping the brain and sensory input structure of an existing organism;  this as opposed to a new model of knowledge representation and learning.   The latter is a much harder problem than initially anticipated in the 50′s and 60′s, and I’m afraid we are no closer to the goal.    I believe our ability to read biological neural activity at progressively finer granularity will outpace innovation on the AI modeling front.

Certainly there is a very long way to go before (or if) we are able to observe individual neuron activity in a vast structure (such as a mammalian brain) OR alternatively scan the configuration of neurons.   Speculating as a layman, I am going to guess that the amount of noise from the vast number of neighboring or enclosing neurons in the 3-dimensional structure would prohibit the extraction (isolation) of an individual neuron signal.  If this problem is insurmountable, may only leave intrusive / destructive approaches to scanning neural structures.

Renaissance in Biology
Biological systems are enormously complex.   In physics and chemistry we try to understand the fundamentals that pertain to matter/energy, information, and time, mostly at the microscopic or gross macro-level.   With biological systems we have a multitude of “state machines” at different levels each of which can be understood at some micro-level, but result in emergent behavior that we, as of yet, hardly understand.

We’re very early in the game, but it is exciting to see biological systems being explored within the fields of mathematics, engineering, statistics, physics, etc.   With computing power came a number of innovations in biology and the approach to biological systems.    Protein folding, genome analysis, simulation / modeling of cellular mechanisms, are just a few of many areas that were previously intractable (or at least too complex to fathom without the aid of massive computations).

With new tools, biology has become more interesting to engineers and computer scientists, amongst others.   Here is a very interesting example of how biological sciences are changing:

Filed under AI, biology

I’ve not put all that much focus on trade exit strategies as usually have relied on trading signal to determine when to reverse a position (in addition to various risk related parameters).   I have a situation, now, where I need to determine the exit independent of the signal.   Thus will need to make an exit decision based on other criteria.

As the saying goes, “cut your losses short and let your profits run”.   As simple as this sounds is not necessarily simple to implement.   Any given price path will have “retracements”, i.e. price will not move in a monotonic fashion.    So how does one set the “trigger” to exit the trade such that can ride over “retracement” noise towards more profit?

Standard Practice
Standard practice is to use a variety of indicators to determine when to exit a profitable trade:

1. retracement as % of “volatility” (recent trading range)
Using recent trading ranges (a proxy to variance) to scale the amount of retracement can work well in certain conditions.
2. arbitrary limits: max profit, maximum trade time
We may have a view around maximum time or profit opportunity for a strategy, allowing us to avoid a later drawdown to signal our exit.
3. various technical indicators

Some Problems
When trading intra-day observing the market tick-to-tick, the above rules do not work well.   High frequency data has many short-lived high-amplitude spikes (high relative to typical price movements).  This is particularly evident in periods of the trading day with lower liquidity.

Whereas recent trading ranges usually provide a reasonable view on the current period’s noise level for medium-low frequency data, the same is not generally true for high frequency.

Optimal Approach
An optimal approach to the problem is to use a price path model where we can determine the probability of a retracement swinging back in the direction of profit;  Or for that matter, determine the probability of subsequent price action retracing prior to any significant retracement.    Such a model cannot possibly be right all of the time, but done successfully will have a significant edge over even odds.

The price path model provides the probability of a price going through a level at time t.  Given that can ask:

• what is the probability of the price going through a level within a time period
• what is the probability of the price remaining within a corridor within a time period

The above allow us to make optimal exit decisions based on risk considerations (the corridor) and likely (or unlikely) movement in the positive direction.

Embedded in such a model, though, is an accurate view on the price process within a near window, a view and a strategy in its own right.

I developed a general calibration and prediction framework for the price path model, however, the price process SDE needs more work (although it shows good predictive behavior in certain market conditions, does not handle all well as of yet).   Are there better alternatives for the short-term?

Mean-Reversion Collar
Short of a semi-predictive probabalistic model the next best thing might be to make use of our recently developed mean-reversion envelope.   The envelope can be tuned to various cycle periods and amplitudes.

Noise exhibits as a mean-reverting process around some evolving mean.   We can tune the envelope to encompass the level of noise we wish to ignore.   The projected mean therefore indicates the overall direction of price on average.    We can use this then to determine whether to carry the trade forward through a retracement or not.