Bitcoin, why I am still interested

The Not-so-Great News

Bitcoin has been in a (mostly) negative trend from its high (~1175) to, as of today, a low of 165, since early 2014.  It has also been associated with negative news across the year, concerning: bitcoin theft, shoddy exchanges, illicit uses, etc.

Bitcoin has been a victim of its own “success” in terms of asset valuation.  The ascent to the > 1000 price was largely built in the Nov – Dec ’13 period, and naturally the coin is reverting to sustainable levels.

Unfortunately, the ascent to 1000+ was associated with a period where 2 algos were aggressively buying ~600,000 BTC from Mt Gox.   It is believed (though not confirmed) that these algos may have been used to perpetrate the fraud that stole the similar amount of bitcoin from Mt Gox (and its user base).   In other words, the algos may, through aggressive buying, have set up the momentum that took the price from sub $200 to over $1000.  The whole movement may have been ignited and sustained by a fraud.  See this analysis.

Bitcoin now is in the perfect storm of:

  1. likely slow liquidation of stolen bitcoin (how much is left, hard to know)
    1. Evidence showed that above algo started to liquidate in early 2014, however could not have liquidated the complete holdings (unless through another agent).
  2. liquidation of mined bitcoin as miners are desperate to recover what they can in a fallen valuation
    1. A significant amount of mining infrastructure was invested in on the basis of inflated BTC valuation.   With adjustments to hash difficulty this may normalize, but for now should pressure miners to liquidate.
  3. financial markets in disarray, triggered by the falling oil -> commodities -> EMG, Equities, etc.
    1. Expect this to leak into the BTC market as well.
  4. More news of stolen bitcoin

I believe Bitcoin and cryptocurrencies in general will recover however.  The negative impact of all of the above makes it very likely that BTC will sink further before it recovers (so far today it hit a low of $165).

The shake-out from this has and will increasingly give rise to more secure financial technology as opposed to the quick-and-dirty implementations (such as Mt Gox’s php travesty).  That said, developing carefully and with appropriate safeguards is not easy, especially in the context of a startup racing to get a product out the door.

So why am I (still) interested

Bitcoin is one of the most transparent marketplaces (if not the single-most).  For example:

  1. trades are labeled in terms of which side aggressed (buy or sell)
    1. buy/sell imbalance
    2. orderbook resiliency
  2. orderbook information is very useful (not as obfuscated with the games in other markets)
    1. OB slope & resiliency
    2. OB prediction modeling

Where it lacks:

  1. low-liquidity compared to more traditional assets (but enough for smaller operations)
  2. futures / forwards market is very low liquidity
  3. shorting can be difficult

In short, like many new markets & asset classes, there is often wider opportunity for earlier participants.   Markets tend to tighten as they mature.


Filed under strategies

Causality in observational data

In the past had created clusters on assets to help identify relationships across assets.   The resulting graphs were useful in identifying assets for use in mean-reverting portfolios.   A difficulty in the approach was always around accurately measuring the strength of relationships between asset pairs.   I had looked at Granger-causality, which is fairly limited in that expects asset relationships to follow a VECM-like model, and eventually settled on weighting across a number of techniques as an approximate.

Determining causality for more general relationships requires a very different approach, where Y ← f(X) + ε.   i.e. f(x) may be an (unknown) non-linear function on X.   I came across an interesting paper: “Distinguishing cause from effect using observational data: methods and benchmarks” (link) which builds on work around looking at the asymmetry of the “complexity” of p(Y|X) versus p(X|Y), where if  Y ← X (X causes Y), p(Y|X) will tend to have lower complexity than p(X|Y).

The paper provides results on two methods: Additive Noise Methods (ANM) and Information Geometric Causal Inference (IGCI), where ANM generally did better across a variety of scenarios than IGCI.

The algorithm for ANM in python-like pseudo-code (taken from the paper):

def causality(xy: DataFrame, complexity: FunctionType, complexity_threshold: float):
    n = nrows(xy)
    xy = normalize (xy)
    training = sample(xy, n/2)
    testing = sample(xy, n/2)

    ## regress Y <- X and X <- Y on training set
    thetaY = GaussianProcess.regress (training.y, training.x)
    thetaX = GaussianProcess.regress (training.x, training.y)

    ## predict Y' <- X' and X' <- Y' & compute residuals on testing set
    eY = testing.y - GaussianProcess.predict (testing.x, thetaY)
    eX = testing.x - GaussianProcess.predict (testing.y, thetaX)

    ## determine complexity of each proposition on variable and residuals
    Cxy = complexity (testing.x, eY)
    Cyx = complexity (testing.y, eX)
    ## determine whether Y <- X, X -> Y, or indeterminate
    if (Cyx - Cxy) > complexity_threshold:
        return X_causes_Y
    elif: (Cxy - Cyx) > complexity_threshold:
        return Y_causes_X
        return None

The paper recommended using a scoring of the Hilbert-Schmidt Independence Criterion (HSIC) as the complexity test. I will have to code the above up and see how it does on a broad set of assets. The paper did test the CEP benchmark which includes some known financial relationships.


Am quite busy now with another investigation, but will revisit this with a proper implementation.


Filed under strategies

Thompson Sampling

I recently attended a talk by David Simchi-Levi of MIT, where he discussed an approach to online price discovery for inventory, to maximize some objective, such as profit.   The scenario was where one could observe whether inventory was sold or not-sold to for each potential buyer in the marketplace, giving an empirical view of demand-behavior as a function of price.   The optimal setting in selling the inventory is one that  maximizes price x liquidation probability.

When we have no knowledge about the true demand for inventory as a function of price, we must start with some prior estimate in the form of a sold/unsold distribution (the demand function in terms of price) and modify this online during the selling period.   The setup of the problem is then:

  • have a fixed period in which can sell the inventory
  • can observe whether a prospective buyer bought at offer price or rejected the price as too high
  • determine a number of different selling price levels & some prior on their respective probability of sale at each price, starting with a uniform distribution.

This can be modeled as the 1-armed bandit problem, where we decide amongst n bandits  (slot machines), determining which slot machine gives the highest payout through iteratively adjusting our model based on observations.

In determining the optimal price, can formulate  as a sequence of bandits: a sequence of prices ranging from low to high.  Associated with each price is a distribution which represents our view on the demand associated with this price (i.e. the probability of sale).  With no prior, can start with an equi-probable view on demand across prices, starting with an initial Beta distribution on at each price level of B(α=1,β=1).

Rather than provide the formal setup for Thompson sampling, will illustrate its use for the above problem.  The sold/unsold distributions & selection of price can then be evolved as follows:

  • take a sample from each distribution associated with a price (representing the probability of sale at a each given price)
  • choose the ith price (1-armed bandit) that maximizes the objective: p(sale[i]) * price[i], where p(sale) is the sampled probability from each demand/price distribution
  • offer the ith price to the marketplace
  • the unit of inventory is observed to be sold or unsold
  • adjust the beta distribution associated with price[i] (the price used in the offer) as follows:
    • if sold:  B’ = B (α+1, β)
    • if unsold: B = B(α, β+1)

The net effect over time is that the distributions will converge such that the optimal bandit (associated with optimal price) will offer the highest probability in sampling or at least the highest expectation (price x p(sold)).  The approach (Thompson Sampling) has many applications in finance for online optimisation.  Definitely going to be part of my toolset going forward.

Leave a comment

Filed under strategies


The T-SNE approach is a very intuitive approach to dimensional reduction and visualization of high dimensional features.  I Implemented this in F# recently and had used some implementations in R and python previously.  I have found this very useful in determining whether I have designed a feature set correctly, such that has a natural degree of separation with respect to labels.

The algorithm is elegant: determine the pairwise probabilities of points being neighbors in high dimensional space, find a distribution in  2-dimensional space that, to the extent possible, reflects the same distances (probabilities). with a bias towards preserving locality.   This is solved as an iterative gradient descent on the pairwise relationship across n points.

The problem is that it is an O(n^2) problem, equivalently solving the n-body problem.   My data sets are in the 100s of thousands or millions, so you can imagine, this is no longer feasible on a single processor.

Barnes-Hut is an approximation approach for n-body problems bringing the complexity to O(n log n), but can only be effective if ones partitioning of space is coarse.  For some of my problems, have found that cannot get appropriate separation with this sort of approximation.

In terms of parallelizing, n-body problems require fine-grained parallelism to solve efficiently.   Given that fine-grained parallelism is limited in scale (how many processors can you have on 1 machine or GPU), have been considering whether there is an iterative hierarchical approach that could be applied with distributed parallelism.  I would not expect that such an approach could get anywhere close to linear speedup, but perhaps can do better than a limited # of local nodes.

Another approach to this problem, which is easily distributed would be a meta-heuristic approach such as differential evolution.

Leave a comment

Filed under strategies

Money Management

It has been almost a year since my last post.  I have been far too busy getting a new trading desk up and running.   I  thought to discuss money management, since am revisiting right now.


It is easy to think that trading signal is the most important aspect of a trading strategy, but money management (and execution) can be even more important.   Loosely defined, money management is a mechanism for position-level risk management.  The mechanism attempts to regulate, accomplishing a number of things:

  1. ride out a profitable signal as long as there is continued profit potential
    1. close out a profitable position when the p&l is drifting sideways
    2. close out a profitable position, alternatively when there is a drawdown
    3. otherwise, allow the position to continue to profit, even through transient negative noise
  2. close out our losing positions as quickly as possible
    1. close position once we have a view that it is unlikely to be profitable
  3. close out strategy if seems unstable
    1. for example keeps hitting stop-loss
    2. risk measures indicative of unstable market situation for strategy

A desirable feature of a money manager is that when pairing the money manager and signal together, we have a return distribution with positive skew and very limited negative tails.   We can even have a signal with < 50% wins, but because of the generated bias in + returns / – returns, have an overall positive equity curve.   Of course I would advocate for much a much higher win ratio than 50% 😉

Signal → Position

I take the approach of having a trading signal that scales between [-1,1] or [0,1] on a continuous basis.   In my trading systems the money manager not only works as a risk manager, but also decides how to scale the signal into a desired position.

For example, if our maximum position is $5 million, we might scale our desired position from 0 to $5 million (if the signal reaches full power at 1).  The 0 or close to 0 level would indicate being out of market, 0.5 being at 1/2 strength or 2.5 million in.   Here is an example signal from 0 to 1:

Trading signals can be noisy, though we do our best to provide smooth signals.   Without regulation of how we map the signal to position size, the up and down dips in the signal would imply thrashing in and out of position, which would be costly.

Hence, we should try to enforce direction monotonicity, so as to avoid thrashing.

Types of stop-loss

There are a number of stop-loss types we should consider:

  1. stop-loss:
    1. stop when (smoothed) equity curve has reached a negative return threshold
  2. stop-profit:
    1. exit an up-to-current profitable trade, but one that has lost some % from the high
  3. stop-drift
    1. a time and slope based stop that closes out a position whose equity curve is drifting more-or-less sideways for a significant period

Risk Reentry Avoidance

On a stop-loss not only want to close the position, but also have to “back away” from the signal, such that we do not immediately get back into an undesirable situation.   Depending on why we exited the position, we may want to:

  1. disable market entry until the signal has gone to zero
  2. impose a time penalty
  3. impose a market reentry restriction (wait for the market regime to reach a certain stage)


Here is a finite state machine that illustrates a possible state system guiding position scaling and money management:

The state system expresses some of the above money management features.   To make it effective, one needs to be clever about deciding on whether a negative movement is a sign to get out or a transient movement.   One can use a combination of signal, smoothed series, and aspects of order book and trade data to have a better read on this.


Filed under strategies

Thought Experiment On Randomness

Was discussing pseudo-random number generators with a colleague, around desirable attributes of the distribution, periodicity, etc — all fun and important stuff.

It reminded me of a chain of thought ranging from complexity theory to the philosophical.   Pardon the armchair physics and philosophy.

As we know from information theory, any sequence generated by an algorithm with no exogenous input, except perhaps an initial state, cannot be random.   There are multiple measures of randomness in theoretical computer science around measuring the degree of pattern in a given sequence (or string).    One that appeals to me, is the Kolmogorov complexity measure (I was reminded of this by recent posts on Shtetl-Optimized) which is approximately:

for some string x (say a random sequence we are testing), what is the shortest possible program in some universal computer language K(x) that prints x and halts.

Intuitively, if a sequence is truly random, there is no algorithm that can generate it short of the sequence literal itself.  Assuming we a looking at very long sequences of  X (our random sequence), the program to produce that must also be very large (as long as the literal) and hence have a very high K(x) measure.  A long pseudo random sequence may look to be quite complex at first glance, but can be described by a short program generating the sequence, and hence has a low K(x) measure.

Not being a complexity theorist, I’m playing pretty loose with the definition, but you get the idea.

Going further, one thing I’ve often thought about, is whether randomness truly exists in the universe.   This may be equivalent to asking, is the universe computable.  This question and theory around this has been proposed and discussed since the 1950s.   Are the random distributions we see in quantum states, for example, just manifestations of a complex computable function, probably one with many dimensions and long periodicity.

If the fabric of the universe is computable, then it stands to reason that what is contained within is also.  In this scenario, we being computable machines made of this macro-stuff called matter, are just very complex functions of our initial states;  Not just our initial state (our conception and program therein), but the state vector of all influences in the universe, our surroundings, and other automata.

Well you can see where this goes.   Free will is a function of our reason and environment, but our reasoning function is also predetermined by this astronomical state vector / ongoing computation.

Anyway, I was reluctant to post this for a while, but thought it would be fun for a change.  Starts with science and ends with philosophical possibility.

I should add, the data we receive from “random” physical processes appears to be random.  If we assume a computable universe, the generating function for these random processes could be vast given the possibility of an “astronomical”, but finite, number of inputs (or dimension).   i.e.  Our approximation of the K(x) function appears to be large just because we do not know how to determine the generating function.   Hence in a short life-time of observation and computation, such a function would have the appearance of randomness (as in maximal complexity) and infinite period.


Filed under strategies


It has been a while since I’ve posted and apologize for leaving the prior subject mid-thread.   I’ve become extraordinarily busy with a new company, but will revert to research mode and follow-up once things settle down.

In the mean time, one of my long-term side interests has been biological systems.  These are interesting because are probably most complex systems we can study.   Our understanding of specific cells and interactions has grown enormously within a century, but we’re still only scratching the surface given crude measurement tools and the mind boggling complexity.   Computational biology provides a simulation based approach to studying these problems, and hence is of particular interest to me.

Some “holy grails” that have interested me:

  1. creation of artificial lifeforms to “prove” that life is nothing more than a system (done: Craig Venter’s team)
  2. mapping of neural structures into simulation
  3. creation of alternative chemistry’s / energy / replicating processes


Here is an article I just found of a researcher in the UK that is having some success with creating structures and processes based on metal salts as an alternative to carbon forms:  Life cells made of metal.   It seems like they have been able to create cell-like structures and may be close to a means of generating energy for the entity via photosynthesis.   Whether this is a fruitful or not, remains to be seen.



Filed under strategies

Smoothed UKF

The smoothed UKF performs smoothing on the state system a-posteriori.   The standard form of this filter is useful in getting an accurate view on prior states, but will not provide a smoothed online state estimate for the current or next period (i.e. no better estimate than the non-smoothed UKF).

That said the smoothed UKF is still useful as:

  1. can be used to estimate the MLE for a stable parameter
  2. a timeseries of smoothed prior states can be regressed to project a smoothed forward state (but is not part of the UKF framework)

The form of UKF smoother that will briefly discuss is the forward-backward smoother.   As the name suggests, the first pass is to perform standard UKF filter, estimating the distribution p({x_t}|{x_{t - 1}},{y_{1:t}}) at each timestep.  Whereas the smoothing estimates a smoothed distribution in reverse, implying p(x_{t - 1}^S|x_t^S,\Sigma _t^S).   The boundary x_T^S,\Sigma _T^S is simply the last state and covariance estimated by the forward filter.

The smoothing approach is then for each x_{t }^S, \Sigma _t^S, determine the predicted x_{t+1 }^-, \Sigma _{t+1}^-, \Sigma _{t,t+1}^-, and back out an appropriate adjustment given a computed kalman gain and the difference between the predicted t+1 and actual t+1 state:

\left[\mu _{t+1}^ - ,\sum _{t+1}^ - ,\sum _{t,t + 1}^ - \right] = UT\left( {{X_t},\sum _{t,t}^{},f(.), \cdots } \right)

K = \sum _{t,t + 1}^ - {\left( {\sum _{t + 1}^ - } \right)^{ - 1}}
X_t^S = X_t^{} + K\left( {X_{t + 1}^S - \mu _{t + 1}^ - } \right)
\sum _t^S = \sum _t^{} + K\left( {\sum _{t + 1}^S - \sum _{t + 1}^ - } \right){K^T}

A simple example of a non-linear function is the Sine function.   Tracking the sine function using a linearization of the sine function with standard Kalman filter will perform poorly.  My test case for the UKF was the following function with noise:

y(t) = {a_t}\sin \left( {{\theta _t}} \right) + {\varepsilon _t}

where {a_t} and \theta _t are linear functions of time.  The states in the system track: A(t),\frac{{dA}}{{dt}},\theta (t),\frac{{d\theta }}{{dt}}.   Here is UKF tracking without smoothing:

And here is the same with smoothing:

I’m enclosing R code for the augmented UKF and smoothed UKF.   For readability I have not optimized all of the R code (i.e. there are some for loops that could be vectorized).   Here is the common part (defining the Unscented Transform).

## modified number of columns
ncols <- function (x) ifselse(is.matrix(x), ncol(x), length(x))

## modified number of rows
nrows <- function (x) ifelse(is.matrix(x), nrow(x), length(x))

#	Determine transformed distribution across non-linear function f(x)
unscented.transform.aug <- function (
	MUx,				# mean of state
	P, 					# covariance of state
	Nyy,				# noise covariance matrix of f(x)
	f,					# non-linear function f(X, E)
	dt,					# time increment
	alpha = 1e-3,		# scaling of points from mean
	beta = 2,			# distribution parameter
	kappa = 1)
	Nyy <- as.matrix(Nyy)

	## constants
	Lx <- nrows(MUx)
	Ly <- nrows(Nyy)
	n <- Lx + Ly

	## create augmented mean and covariance
	MUx.aug <- c (MUx, rep(0, Ly))
	P.aug <- matrix(0, Lx+Ly, Lx+Ly)
	P.aug[1:Lx,1:Lx] <- P
	P.aug[(Lx+1):(Lx+Ly),(Lx+1):(Ly+Ly)] <- Nyy

	## generating sigma points
	lambda <- alpha^2 * (n + kappa) - n
	A <- t (chol (P.aug))
	X <- MUx.aug + sqrt(n + lambda) * cbind (rep(0,n), A, -A)

	## generate weights
	Wc <- c (
		lambda / (n + lambda) + (1 - alpha^2 + beta),
		rep (1 / (2 * (n + lambda)), 2*n))
	Wm <- c (
		lambda / (n + lambda),
		rep (1 / (2 * (n + lambda)), 2*n))

	## propagate through function
	Y <- apply(X, 2, function (v)
			f (dt, v[1:Lx], v[(Lx+1):(Lx+Ly)])

	if (is.vector(Y))
		Y <- t(as.matrix(Y))

	## now calculate moments
	MUy <- Y %*% Wm

	Pyy <- matrix(0, nrows(Nyy), nrows(Nyy))
	Pxy <- matrix(0, nrows(MUx), nrows(Nyy))

	for (i in 1:ncols(Y))
		dy <- (Y[,i] - MUy)
		dx <- (X[1:Lx,i] - MUx)

		Pyy <- Pyy + Wc[i] * dy %*% t(dy)
		Pxy <- Pxy + Wc[i] * dx %*% t(dy)

	list (mu = MUy, Pyy = Pyy, Pxy = Pxy)

The UKF without smoothing:

#	Augmented UKF filtered series
#		- note that f and g are functions of state X and error vector N  f(dt, Xt, E)
#		- Nx and Ny state and observation innovation covariance
#		- Xo is the initial state
#		- dt is the time step
ukf.aug <- function (
	Xo = rep(0, nrow(Nx)),
	dt = 1,
	alpha = 1e-3,
	kappa = 1,
	beta = 2)
	data <- as.matrix(coredata(series))

	## description of initial distribution of X
	oMUx <- Xo
	oPx <- diag(rep(1e-4, nrows(Xo)))

	Yhat <- NULL
	Xhat <- NULL

	for (i in 1:nrow(data))
		Yt <- t(data[i,])

		## predict
		r <- unscented.transform.aug (oMUx, oPx, Nx, f, dt, alpha=alpha, beta=beta, kappa=kappa)
		pMUx <- r$mu
		pPx <- r$Pyy

		## update
		r <- unscented.transform.aug (pMUx, pPx, Ny, g, dt, alpha=alpha, beta=beta, kappa=kappa)
		MUy <- r$mu
		Pyy <- r$Pyy
		Pxy <- r$Pxy

		K <- Pxy %*% inv(Pyy)
		MUx = pMUx + K %*% (Yt - MUy)
		Px <- pPx - K %*% Pyy %*% t(K)

		## set for next cycle
		oMUx <- MUx
		oPx <- Px

		## append results
		Yhat <- rbind(Yhat, t(MUy))
		Xhat <- rbind(Xhat, t(MUx))

	list (Yhat = Yhat, Xhat = Xhat)

The UKF with smoothing:

ukf.smooth <- function (
	series, 					# series to be filtered
	f, 							# state mapping X[t] = f(X[t-1])
	g, 							# state to measure mapping Y[t] = g(X[t])
	Nx, 						# state innovation error covar
	Ny, 						# measure innovation covar
	Xo = rep(0, nrow(Nx)), 		# initial state vector
	dt = 1, 					# time increment
	alpha = 1e-3,
	kappa = 1,
	beta = 2)
	data <- as.matrix(coredata(series))

	Lx <- nrow(as.matrix(Nx))
	Ly <- nrow(as.matrix(Ny))

	## description of initial distribution of X
	oMUx <- Xo
	oPx <- diag(rep(1e-4, nrows(Xo)))

	Ey <- rep(0, nrows(Ny))
	Ex <- rep(0, nrows(Nx))

	Ms <- list()
	Ps <- list()

	## forward filtering
	for (i in 1:nrow(data))
		Yt <- t(data[i,])

		## predict
		r <- unscented.transform.aug (oMUx, oPx, Nx, f, dt, alpha=alpha, beta=beta, kappa=kappa)
		pMUx <- r$mu
		pPx <- r$Pyy

		## update
		r <- unscented.transform.aug (pMUx, pPx, Ny, g, dt, alpha=alpha, beta=beta, kappa=kappa)
		MUy <- r$mu
		Pyy <- r$Pyy
		Pxy <- r$Pxy

		K <- Pxy %*% inv(Pyy)
		MUx = pMUx + K %*% (Yt - MUy)
		Px <- pPx - K %*% Pyy %*% t(K)

		## set for next cycle
		oMUx <- MUx
		oPx <- Px

		## append results
		Ms[[i]] <- MUx
		Ps[[i]] <- Px

	## backward filtering, recursively determine N(Ms[t-1],Ps[t-1]) from N(Ms[t],Ps[t])
	for (i in rev(1:(nrow(data)-1)))
		## transform
		r <- unscented.transform.aug (Ms[[i]], Ps[[i]], Nx, f, dt, alpha=alpha, beta=beta, kappa=kappa)
		MUx <- r$mu
		Pxx <- r$Pyy
		Pxy <- r$Pxy[1:Lx,]

		K <- Pxy %*% inv(Pxx)
		Ms[[i]] <-  Ms[[i]] + K %*% (Ms[[i+1]] - MUx)
		Ps[[i]] <- Ps[[i]] + K %*% (Ps[[i+1]] - Pxx) * t(K)

	Yhat <- NULL
	Xhat <- NULL

	for (i in 1:nrow(data))
		MUy <- g(dt, Ms[[i]], Ey)
		MUx <- Ms[[i]]

		## append results
		Yhat <- rbind(Yhat, t(MUy))
		Xhat <- rbind(Xhat, t(MUx))

	list (Y = data, Yhat = Yhat, Xhat = Xhat)

Finally, here is the Sine test:

#	Amplitude varying sine function state function:
#		Yt = Amp[t] Sin(theta[t]) + Ey
#		Theta[t] = Theta[t-1] + Omega[t-1] dt + Ex,1
#		Omega[t] = Omega[t-1] + Ex,2
#		Amp[t] = Amp[t-1] + Accel[t-1] dt + Ex,3
#		Accel[t] = Accel[t] + Ex,4
sine.f.xx <- function (dt, Xt, Ex)
	nXt <- c (
		Xt[1] + dt*Xt[2] + Ex[1],
		Xt[2] + Ex[2],
		Xt[3] + Ex[3],
		Xt[4] + Ex[4])


#	Amplitude varying sine function observation function:
#		Yt = Amp[t] Sin(theta[t]) + Ey
#		Theta[t] = Theta[t-1] + Omega[t-1] dt + Ex,1
#		Omega[t] = Omega[t-1] + Ex,2
#		Amp[t] = Amp[t-1] + Accel[t-1] dt + Ex,3
#		Accel[t] = Accel[t] + Ex,4
sine.f.xy <- function (dt, Xt, Ey)
	y <- Xt[3] * sin(Xt[1] / pi) + Ey[1]


series.r <- sapply (1:500, function(x) (1+x/500) * sin(16 * x/500 * pi)) + rnorm(500, sd=0.25)

## unsmoothed test
u <- ukf.aug (
	series.r, sine.f.xx, sine.f.xy,
	Nx = 1e-3 * diag(c(1/3, 1, 1/10, 1/10)),
	Ny = 1,
	Xo = c(0.10, 0.10, 1, 1e-3), alpha=1e-2)

data <- rbind (
	data.frame(t = 1:500, y=series.r, type='raw', window='Price'),
	data.frame(t = 1:500, y=u$Yhat, type='filtered', window='Price'),
	data.frame(t = 1:500, y=u$Xhat[,3], type='amp', window='Params'))

ggplot() + geom_line(aes(x=t,y=y, colour=type), data) + facet_new(window ~ ., scales="free_y", heights=c(3,1))

## smoothed test
u <- ukf.smooth (
	series.r, sine.f.xx, sine.f.xy,
	Nx = 1e-3 * diag(c(1/3, 1, 1/10, 1/10)),
	Ny = 1,
	Xo = c(0.10, 0.10, 1, 1e-3), alpha=1e-2)

data <- rbind (
	data.frame(t = 1:500, y=series.r, type='raw', window='Price'),
	data.frame(t = 1:500, y=u$Yhat, type='filtered', window='Price'),
	data.frame(t = 1:500, y=u$Xhat[,3], type='amp', window='Params'))

ggplot() + geom_line(aes(x=t,y=y, colour=type), data) + facet_new(window ~ ., scales="free_y", heights=c(3,2))

If you find the above code useful or have improvements to suggest, kindly send me a note. Thanks.

Next Steps
Some next steps:

  1. Determine state-system appropriate for market making & cycling regimes
  2. Determine Multi-Model switching approach
  3. Evaluation & Analysis

I will probably make a digression onto another topic in the next posting before revisiting this.  Stay tuned.

p.s. if you use wordpress, they have a latex equation rendering capability I just discovered (hence better inlining in this post).


Filed under strategies

Unscented Transform & Filters

The unscented transform (and filter) was developed as a practical alternative to MC sampling in estimating the conditional distribution functions of hidden state systems.   We want to determine the hidden state (Xt)  given observations (Yt):

The  Sequential MC approach to determining this distribution p(Xt | Y1:t) is calculated as folows:

This works for us, since we know the observations y[1:t] and we should have a view on the likelihood of state X given Y by projecting samples of X into Y via function g(.).   There are various approaches to determining the likelihood and reducing the number of particles (samples) required to arrive at a robust integration (such as SIR).

The basic MC particle sampling algorithm is as follows:

  1. generate N samples of X[t],i from f (X[t-1] + noise), transformed from X[t-1] to X[t] via that f(.) function
  2. determine “particles” as p(Yt | X'[t],i) / normalizing constant, where p(Yt|Xt) can be determined by applying function g(x) and determining likelihood of computed Yt given Xt.
  3. some resampling and kernel weighted selection may occur, depending on algorithm
  4. Mean is then the probability weighted average of the Xt’s (as indicated in the above integral)

I’ve implemented and used a number of variations of these in the past.   MC approaches are unavoidable for complex distributions (for instance ones that are multi-modal).   That said particle filters can be orders of magnitude slower than the Kalman family of filters.

Unscented Transform
Enter the unscented transform.   Much as with the SMC approach, the unscented transform uses sampling to determine the mean and higher-order moments of the distribution: p(x[t] | y[1:t]).   Instead of generating numerous random samples, the unscented approach is to take samples at specific points around the current mean.

The unscented approach also avoids approximating the non-linearity of  system functions, instead looks to determine the simpler problem of approximating the distribution.  2^d + 1 sample (or sigma) points are determined around the mean and transformed through the non-linear functions to arrive at a view of the distribution of the state vector.

As illustrated in Julier and Uhlmann’s 1997 seminal paper:

The sigma points determined around the mean (Si and with associated weights Wi), are transformed through the non-linear system function allowing us to determine the moments of the distribution p(μ, P), with mean μ and covariance P (higher order moments can be computed as well):

It then becomes a question of how to select these sigma points Si around the mean Xt.   We know that we want to choose Si = Xt + Ei, where each vector Ei provides an appropriate spread in a given dimension, along the orientation of the distribution.  If these points are well chosen, will provide enough information to reconstruct moments of the distribution.

Under the assumption that Xt (pre-transform) is elliptically distributed, it turns out that the column vectors of the Cholesky decomposition of the covariance matrix specify vectors co-linear with the axes of the distribution.   These are determined and scaled as follows:

For more detail read the following paper.

Why the UKF?
The typical implementation of the EKF uses a linear (and sometimes quadratic) approximation in the update of the distributions.  This can fail spectacularly (or just present significant error) unless your system is well behaved through 1st order dynamics.   The UKF approach also does not require calculation of Jacobian or Hessian matrices, which for some problems may be extremely difficult or impossible to provide.

Next Step
I am particularly interested in the forward-backward UTF with smoothing.   I have found that smoothing on the priors (not just the immediate t-1) provides a better forecast for the current and next periods.   Will write more about this next.

Leave a comment

Filed under strategies

Thinking About State Space Filters

I have not used stochastic state based systems for a couple of years, but have decided to revisit.   I had previously implemented a number of systems with both the EKF and 3 variants of particle filter, but encountered various issues.

In particular, found the parameterization of the evolution via the covariance matrices to be opaque and problematic for some systems.    Also both the EKF and particle filters were subject to numerical instability if the mean of the state distribution or observed samples shifted significantly with likelihoods approaching 0.

I’ve decided to revisit state space filters but with a different filtering approach, for the purposes of coming up with a different approach to multi-regime pricing.

I’ll start with an introduction.  The traditional state space model is a discrete (or discretized continuous) system represented by a measurement equation and state evolution equation:


  1. X is a n dimensional vector representing the state of the system on timestep t
  2. Y is a m dimensional vector  representing the measurement on timestep t (for example a price in our price series)
  3. f(.) is our state evolution function mapping the t-1 th state to the t th state
  4. g(.) is our state to measurement mapping
  5. ε_x our state noise / error ~ N (0, Σx)
  6. ε_y our observation noise / error ~ N(0,Σy)

Kalman Family of Filters
The kalman family of filters uses a bayesian logic to implement a linear error correction model, such that errors in the estimation of Yt propagate back to the evolving state Xt (or more precisely, an evolving distribution of Xt).

The traditional Kalman filter assumes a linear function f(.) and g(.), usually expressed as matrices:

We are interested in estimating the state Xt given the noisy observed data Yt, hence are interested in the pdf p(x[t] | y[1:t]).  Given that we will have a lagged (t-1) view on this pdf via the recursion inherent in filtering, we first need to be able to compute p(x[t] | y[1:t-1]) and then using that p(x[t] | y[1:t]):

Hence as pointed out in (Hartikainen 2008) the above linear state equations map to distributions:

I won’t go into the math for the Kalman filter, as would like to discuss its issues and move on to more sophisticated approaches.  The 3 main issues I have with the Kalman Filter:

  1. can only express the evolution of linear state functions
  2. hard to calibrate degree of fit via state and measurement noise covariance matrices
  3. can become numerically unstable with unexpected noise (probabilities go to 0 or may oscillate wildly)

The Extended Kalman Filter partially solves one of these issues (that of non-linear state functions) by adjusting the kalman distribution estimation to use 1 or 2 terms of the taylor series expansion of the state and observation functions.   This works well for functions completely described by 1st and 2nd derivatives, although has similar drawbacks in terms of calibration and stability.

Short of using a numerically expensive particle filter, it seems that a variant the Unscented Kalman Filter (UKF) presents the best choice for the potential state systems I will be using.    The UKF using a deterministic sampling approach mapped through the non-linear functions, provides multiple moments for the underlying distribution.   The distribution can then be described with greater accuracy than afforded through the EKF’s taylor expansion.

Topics to be Discussed:

  1. Unscented Transform
  2. Smoothing
  3. Need for a Multiple Model approach with Markovian switching
  4. Filter approach for multiple models
  5. Possible Models


Filed under strategies