Monthly Archives: January 2010

The ideal Quant Environment

R is a wonderful tool in that one can prototype and test ideas very quickly.   If your not using R and doing most of your work in C++ or other low-level language, your missing a lot.   The speed of development between a C++ / Java / C# versus R is 10-50x.

R definitely has its warts as a language and environment.  I not a huge fan of its matrices, index operators, or lazy expression parsing.   More crippling is the fact that R is slow and has memory issues for large data sets.    I would estimate that R is 100x slower than java or C++, depending on what you are doing.

My current environment is a combination of R and a number of lower level languages.   For much of my post-exploration work I feel compelled to write in a lower level language due to the performance issues with R.   My preference is to be using a functional language, as they are generally very concise and elegant.

Ideal Environment
Here is what an ideal environment for me would be:

  1. breadth and depth of R
  2. clean functional language design
  3. concise operations (as close to the math as possible)
  4. excellent rendering facilities
  5. real-time performance
  6. ability to work with large data-sets (memory efficient)
  7. concurrency (I do a lot of parallel evaluation)

Candidates
Here are candidate environments that I’ve used or explored:

  1. python
    cleaner, generally faster than R, very little in the way of statistics and poor integration with R.   No real concurrency as interpreter locked.
  2. Ocaml
    Beautifully concise language, INRIA implementation does not support concurrency
  3. F# variant of Ocaml
    Solves Ocaml’s problems but bound to MS platform
  4. Scala
    Excellent performance, a bit bleeding-edge, much more complex than Ocaml, but on the JVM.

The Special Blend
It is impractical to consider reimplementing even a subset of R into another language environment.   A hybrid approach makes sense.   With python you have Rpy and with Java  JRI.   Neither of these have first class interaction with the language though.

I would like to have the power of a production functional language, but with the same development ergonomics and interaction that R has.    It occurred to me that I could do the following:

  1. dump function templates of all R functions in one’s environment into a Scala module
  2. create implicit conversions from R fundamental types into Scala objects (matrices, vectors, data frames, etc)
  3. create specialized mappings between my Scala-side timeseries and R zoo / ts objects
  4. create some wrappings for unusual usage patterns such as the ggplot operators

Scala has an interactive mode where functions, classes, etc. can be created on the fly.   Because we’ve dumped proxy function for each R function, we also have near first class access to R functions.   There would be some differences in that would not be able to replicate the lazy evaluation aspect of some of the R functions.   Functions that use expression() would have to be wrapped specially.

Because scala allows a lot of syntactic magic, the environment would look very much like R and have complete access to R, but with the huge upside of the more powerful Scala.    One can write code that is “production ready” from the get-go and/or do very compute intensive operations otherwise prohibitive in R.

Now I need to find the time to put this together …

Addendum
I have not decided on which language / environment to base this on.   Scala’s main sell is that it is on the JVM.   The other functional language contender on the JVM with above-scripting-level performance is Clojure.

Clojure is basically a dialect of Lisp and there is already a project  called Incanter that provides a statistical environment within Clojure.    It looks interesting, if early.   Clojure does not yet have a performance profile that is close enough to the metal.   I would expect to see improvements over time though, but due to Lisp’s lack of static typing or type inference, I am doubtful that will see Clojure at the level of statically typed languages.

Since writing this post and having some conversations, I’ve begun to think that F# may be the best choice.   My language preference has been to use Ocaml, but the INRIA Ocaml implementation is handicapped.   F# is closely related to Ocaml and therefore may be a fit.

F# on the MS .NET platform has been shown to be as performant as C#.    From benchmarking C# a couple years ago, was clear that the CLR is pretty close to the JVM in terms of performance.   Given my cross platform constraints, the question has been how viable is F# with Mono from a performance point of view?

It seems like the Mono performance  is being addressed.    The mono LLVM experiment improved mono benchmarks significantly.   I have not been able to test Mono with this extension.   Will have to experiment.

To wed this to R would require writing the equivalent of JRI for .NET / R.

32 Comments

Filed under strategies

Cointegration Clusters

Previously I had written a simple clustering algorithm using correlations to look at rough “relationships” between equities, whether real or “accidental”.    I had evaluated this on the S&P 500.

I decided to do a much wider evaluation on all US equities with average volume > 200K and market cap > 500M + the ETFs.   This subset of equities results in about ~2800 assets or 3,780,000 pairs to be evaluated.    Evaluating this in R was impractical, so evaluated in Java.    The evaluation time was in seconds rather than the days R might have required.

To discover stronger relationships evaluated both the correlation and Augmented Dickey-Fuller test on all pairs,  keeping pairs with adf p-value < 0.05.   Additionally did cross-validation against prior years, throwing away pairs that did not fit in the cross-validation period.

I then used my clustering algorithm (outlined in a previous post) to determine networks of maximally related assets.

This resulted in ~1000 pairs (of the original ~4 million pairs).    Some portion of these pairs “make sense” and others are complete surprises.  That said, given the large number of series it would not be a surprise to find “unrelated” assets that are relatively cointegrated over the test period.

It generated ~50 clusters, here is one at random:

Trading The Pairs
T
he beauty of cointegrated series is that they are much easier to model then series with heteroscedasticity or trending mean.    There are a number of approaches to trading pairs (or larger cointegrated portfolios).    Before getting to this, first want to illustrate a non-cointegrating spread:

We can see that the spread between NI and ACG has a spread that is growing with time from the axis.  The ideal spread is one that oscillates around a constant mean (generally 0).   The above can be traded, but would involve, first a view that the there is a long run beta differential of ~0.25 and weighing the basket appropriately.

Many look at the relative “beta” (i.e. the slope of the long run cumulative returns) for each asset and determine weights based on a linear regression.    That approach works well if trends follow a near linear path over the observation period.

A better approach is to find the weights such that the spread “spends as much time above the origin as below the origin” (ok, it’s a rough heuristic I came up with).   This can be expressed as:

Basically the above is “saying”: find the integral of some weighting of the spread function such that the area is as close as possible to 0 (i.e. we have balanced sweep above and below the origin).   The constraints make sure that the weights don’t go to 0.

If it is Cointegrated
In theory this is a much simpler scenario where we can chose equal and offset weights (-1, 1) and then analyse the resulting spread for entry and exit (technically one may still adjust the weights to adjust for drift depending on MR period one is focusing on).

Next, we want to look for mean-reversion patterns or at least identify levels likely to mean revert conditioned on the past.  Here is a pair that is cointegrated for this period:

The typical approach is to normalize the spread to standard deviations and enter reverting trade when 2 SD or another suitable threshold is realized.   Some basic observations of momentum and vol can be used to decide precisely when to enter.

Another approach used is to calibrate some descendent of the Ornstein-Uhlbleck MR model to the desired level of MR and use as a driver for entry.     I’m not trading pairs at this time, so I’m not sure whether it is worth adopting a MR model.   From past experience with these models, they are hard to calibrate and require significant modification to match empirical behavior, even loosely.

Beyond Pairs
We use pairs to provide a more desirable process statistically, more amenable to MR analysis.   There is a much wider universe of possibilities present in “spread baskets”.    By “spread baskets” am refering to collections of more than 2 assets that are fractionally long or short, producing a tightly cointegrated return.

Determining such baskets is very complex for a number of reasons:

  1. size of search grows at roughly O(N^k), where k is the size of basket and N the number of assets
  2. one needs to determine optimal weights (expensive NLP)
  3. optimal weights need to be tested in cross-validation

Mitigating the worst case is:

  1. can throw away assets with low correlations

To give an example, if we consider the 3-asset case on 2000 stocks, the worst case search would involve 2.6 billion combinations to check.    The correlation matrix may well make this viable however.


4 Comments

Filed under strategies

Poor Man’s Knot Reduction

There are many great implementations of unit-root tests in packages such as R, SAS, etc.  For various reasons I need to implement this in Java.   Probably the best known test of this sort is the Augmented Dickey Fuller test.   This is most often used in determining whether a process is non-stationary or rejecting in favor of stationarity.

Loosely speaking, to say a process is stationary implies that all of its moments are largely time invariant.   So for example the mean and variance of a stationary process would be (largely) constant across time.

We know that AR(1) processes with AR coefficient α = 1 have unbounded variance that grows with time.  This can be seen by evaluating its variance:

The ADF test therefore tests the hypothesis that a given series expressed as a AR(p) process has a coefficient of 1 (ie unit root).   The ADF expands upon the original Dickey Fuller test to allow for additional AR lags:

The null hypothesis is that β = 0 (non-stationary process) and alternative hypothesis (may be stationary) otherwise.   Note that  β = 0 in this differenced equation is equivalent to α = 1 in the non-differenced AR(1) equation.

Samples (i.e. the difference between the observed and null hypothesis value) in hypothesis testing often follow a student-t distribution.   In the case of the Dickey Fuller test the distribution is not student-t and therefore presents a problem.  To calculate the p-value for the ADF hypothesis, one must refer to special Dickey-Fuller tables, as there is no known closed form method to calculate this (that I am aware of).

The p-value, we know to be the probability of obtaining a result at least as extreme (far away from the hypothesis value) as the one obtained in our sample.  Basically this involves calculating:

Then to determine the p-value for the ADF, we need to determine the distribution of values around hypothesis β = 0.  The tables provide precalculated mappings between test values and the cumulative probability.

Problem
The Dickey-Fuller tables are sparse, particularly at the tails.   For my problem I was interested in the relative “stationarity” of many different series.   For this, I the needed to calculate the Dickey-Fuller pdf on my own with a Monte-Carlo simulation.

Here is one configuration of the pdf for the Dickey-Fuller hypothesis (produced from a monte-carlo simulation):

Of course monte-carlo simulations are not practical on each evaluation of the ADF test.   The results from the MC simulation generate a high resolution density function as a series of (x,y) points.   I could generate distributions for a number of configurations and store a very large 3D table of function values, but is impractical and unnecessary.

I started thinking about an approximating function for the distribution.   Aside from finding functional form for the distribution through trial and error, a general automated approach is to use splines.   A natural spline will fit the function and be piecewise continuous.

Splines, however, do not reduce the dimensionality of the problem.  A natural spline requires the complete set of knot points to reconstitute the function.   I then started thinking about knot reduction techniques.   With knot reduction the idea is: what is the minimum set of knots required to reproduce the target function within some (small) error range.   So if my function currently has 2000 (x,y) points, perhaps I could reduce it to ~10 points?

There have been many papers written on this topic, often involving very complex algorithms.   As I wanted to keep this simple, decided to come up with a “poor man’s algorithm” for knot reduction.

Algorithm
The algorithm starts with a spline consisting of the 2 end-points  and gradually adding knots at specific positions, iteratively reducing the error between the “reduced spline” and the target function.   The new knots are chosen based on the maximum distance from the target function.

The algorithm is not guaranteed to be optimal in that it may have more knot points than the smallest possible reproducing set.  However, it is very fast and is often close to optimal, depending on the function.   The algorithm is as follows:

  1. start with a spline consisting of 2 points (the start and end point)
  2. loop while R^2 < (1 – eps)
    1. add knot at point with maximum distance from target function
    2. compute new spline and R^2 fit metric
  3. at end of loop, set of knots now matches function within epsilon

The effectiveness of the approach can be enhanced by using a better basis function than the hermite cubic and/or by adjusting tensions in addition to knots.

Here is an example of the function to epsilon = 1e-8.  I’ve slowed this down to 1 iteration per second so you can see how it works.   In practice this involves a few micro-seconds:

9 Comments

Filed under strategies

Theory of Computation, Quantum Computing, etc

I’d like to point to a very interesting set of lectures by Scott Aaronson entitled “Quantum Computing Since Democritus”, here. The lectures have a lot of thought provoking content in addition to being entertaining — a style reminiscent of Richard Feynman.

Leave a Comment

Filed under strategies

Algorithmic Differentiation

In the previous post I outlined a portfolio utility function with a penalty for risk:

One of the components of this function, E[r], is specified as a weighted geometric average.   By itself I could transform to use a log sum but does not simplify with the added risk term:

I need to construct a gradient and hessian for the above function:

It is not that the 2nd partial derivative is intractable, indeed it just requires the application of the chain rule.  Unfortunately it expands to an enormously complicated expression that grows quadratically with the number of variables.

Computing the derivative via finite differencing works for some applications, but presents problems where the “perturbations” need to be small.   The 2nd derivative using a central weighing scheme:

One can see that the differenced approach is susceptible underflow.   Choosing a value of h < 1e-7 will produce a value which exceeds the precision of a double (consider the denominator).   Once the precision of double is exceeded, results will be meaningless.

Algorithmic Approach
I began to think of a recursive approach to evaluating the nth partial derivative, as I had trouble coming up with a pattern that fit into a single expression for generalized number of periods and dimension.   It turns out that many others have thought along these lines.   This approach is called “Algorithmic Differentiation”.

The above equation can be simplified with respect to just 2 of its N variables for the purposes of calculating 1 coordinate in the hessian as:

We understand how to differentiate the above by means of the chain rule, which states that compositional functions, such as f(g(x)) can be differentiated as f’(g(x)) g’(x).   This implies a series of basic rules:

We can decompose an expression into an (implicit or explicit) expression tree and use the rules for derivatives to recombine to an ultimate result.   For instance, here is a tree for f(x,y) = x y sin(x^2).   (the graph for my function would be much too large to render here):

There are a number of libraries (for C++ and Java) which allow one to express a function in normal-looking code, but implicitly calculate the first or second derivatives at the same time.   A “dual” representing f(x) and f’(x) is tracked through the evaluation of the expression.

Unfortunately, the size of the expression (tree) for my utility function is enormous, particularly for the 2nd derivative.   I will avoid computing the hessian and use the BFGS minimizer for part of my optimization, as will end up being cheaper than evaluating the hessian.

Leave a Comment

Filed under strategies

Strategy Allocation

I’m interested in allocating capital across a series of strategies or assets in a way that balances between maximizing return and minimizing drawdown.   I’ve not spent much time thinking about this previously, so this is a work in progress.

Mean-Variance Approach
It has long been traditional to use the mean-variance approach for portfolio allocation.    A problem with the mean-variance approach is that it penalizes excess returns on the “right” (positive) side of the distribution in addition to the “left” (negative).   This is one reason why many consider the Sharpe ratio to be a flawed measure with respect to how investors see risk.

The mean-variance approach penalizes for sections D and C of the distribution.   An investor is generally happy with excess returns (i.e. D), and unhappy with drawdowns in the distribution (namely areas A and C).  We would prefer to accept the reward of the whole right side of the distribution and attempt to minimize the left.

Lower-Moment Frameworks
To remedy this we can use an asymmetric measure of variance to focus on the part of the distribution that represents our risk.   The family of such moment functions are called the “Lower Moments”.    Lower moment functions allow us to focus on sections A and C in the distribution (ie the drawdowns) and not consider the right side as a contribution towards risk.

Lower moment functions are expressed as follows:

The idea, simply, is that the moment (degree m) is computed on the region of the distribution below some threshold (say 0).

For a risk measure, we may want to penalize returns < 0 or prehaps returns < risk free rate.   We would choose τ according to where we want to penalize.

Application to portfolio
Ok, so the lower-moment looks like a good measure of downside risk.  How do we incorporate it?  Let’s set up the problem:

  1. let R represent a matrix of returns
    One row for each period, number of columns = number of assets or strategies.
  2. let w represent the vector of weights for each asset / strategy
  3. let Σd represent the lower-moment covariance of returns with respect to the set of assets or strategies

Assumptions:

  1. let us assume that past performance is indicative of future performance  (we can relax this later)
  2. let us assume an eliptical (i.e. multivariate normal) distribution for negative  returns
  3. let us assume that the drift grows in proportion to variance x time

Ok, those are pretty aggressive assumptions, but they allow a first look at putting this together.

We need to create a utility function that blends the upside and downside, subject to various constraints.  Here f(w) represents the portion of a convex utility function rewarding positive return, and g(w) a function describing the risk penalty.

Let’s formulate the utility as the expected return for the next period (given the prior returns) and the expected negative return component (our risk) over the period:

The first part of our utility E[r] represents the average observed return and the second part the negative drift (our downside risk).    E[r] can be computed in any number of ways, for instance:

  • as a time weighted average of past returns
  • as a regression model with seasonality
  • from a stochastic model calibrated to past returns

The risk penalty is just ½ E[(τ - <w,r>)^2 | r < τ) · Δt, which approximates the expected downside drift over the period represented by Δt.

Now the above is a bit ad-hoc and I may be double counting the upside and downside with E[r].   I’ll have to think on the implications for mixing a lower moment measure and a full moment measure.

There may be some adjustments to make, but have outlined a first pass at penalizing returns and hopefully arriving at a more balanced weight selection.

Addendum
A more generalized approach could be to compose the utility function from a parameterized sum of upper and lower moments (Farinelli and Tibiletti 2002).   I’ve adjusted this a bit to fit my problem:

The number of moments for the lower portion can be different for the upper portion (or simply let the coefficients of some moments be 0).    What I came up with in the prior section is essentially this function for 2 moments with coefficients {1,0} and {0,1/2}.

Addendum 2
The utility function should have been (drift in proportion to sqrt), my bad:

Leave a Comment

Filed under portfolio, statistics, strategies

Approximating |x|

I am changing my portfolio strategy to use an interior point optimizer for non-linear problems with constraints.  The approach solves nonlinear problems of the form:

where f(x) is the primary function to minimize and g(x) represents one or more constraints (equalities or inequalities).   There are a number of good papers on interior point methods for solving such problems so will not go into the method here.

In my case one of the constraints is as follows:

As simple as this looks it presents a problem because the interior points algorithm requires that the constraints be twice differentiable.   |x| is continuous but not differentiable at 0.   In formulating the solution I need to construct the gradient (jacobian) and hessian matrices of the constraints.

I began to think of other ways to describe the constraints.  Basically I want to allow my portfolio to allocate both long and short positions between trading periods.  The above constraint treats long and short capital as symmetrical (which suits my purposes).   I.e., if I have just two assets in a portfolio and allocate -0.75% short to one asset and 0.25% long to the other asset, I consider that to be fully allocated (from a risk capital point of view).

I did a few thought experiments, considered variations of:

But this is clearly wrong, in that will allow for allocation > 100%.   I began to think about the derivative function of |x|.   See (in blue |x| and red the derivative):

It occurred to me that the derivative can be asymptotically approximated by the sigmoid function with a shift and accelerated exponent:

By making β large enough we can approach the derivative of |x| to arbitrary resolution.

Unfortunately the integral of the sigmoid function has no solution in R, and only has a solution on the complex plane, meaning that if we are to deploy this approach will have to evaluate g(x) with Σ|x| and the derivatives with the scaled and translated sigmoid function.

Unfortunately, a hybrid approach with |x| for constraint evaluation and sigmoid for the derivatives will lead to instabilities.   One needs to have an approximation to the derivative that has an integral.

Addendum
It turns out this is a common problem in machine learning and optimisation.   Chen and Mangasarian defined an approximation to the sigmoid integral.    The problem can be reformulated as follows.   Break the problem into two pieces, x ≥ 0 and  x ≤ 0.    If we can solve it for x ≥ 0 then we also have a solution for x ≤ 0:

Based on an approximation to the sigmoid integral, they determined (x)+ to be:

As with the sigmoid function, increasing β allows us to get asymptotically close to |x|.   We now have a function for |x| with derivatives defined around 0.

3 Comments

Filed under strategies