# Distribution Estimation

I’ve been travelling for the last 3 weeks so have not had much time to post.   During the trip, I’ve been thinking further about the following problem:

1. We are interested in determining a representative distribution for some financial factor
2. suppose we start with start with a “universal” high-dimensional joint distribution of all random variables that might possibly have relevance to the probability of some financial factor
3. suppose that the number of variables in the universal joint distribution is high enough that the distribution is sparse relative to the sampled data.  We need to reduce the number of degrees of freedom.
4. Is there a smaller marginal distribution (a subset of variables) that provides representative modes and distribution shape?
5. How do we determine it?

Some observations:

1. We should expect clustering around 1 or more modes
2. Variables with little impact will not introduce new modes or substantially alter the shape of the distribution
3. Adding a new low-impact variable (dimension) should just stretch the existing modes uniformly along the new dimension

This brings to mind a brute-force approach that would involve:

1. observing all possible marginal distributions
2. applying a measure to each, determining the degree of information within, select one that maximizes information and penalizes for number of variables

The first step can possibly be shortcut by reducing incrementally, but may not find the global optimum.   The notion of “information” also needs to be defined.   I’ll post more later on this.

Another approach would be to formulate as an expectation maximisation problem, but I have not worked out how this would be done.

Filed under strategies

### 2 responses to “Distribution Estimation”

1. IceViking

I’m a bit of a noob, so pls forgive. Regarding (3) the field of Compressive Sampling comes to mind. Where you subsample/subNyquist/etc your sparse matrix and reconstruct the signal via convex optimization, ensuring your local optima is a global one. Not sure this relates to your thoughts, but wanted to mention it at the very least. Feel free to email me if you’d like to discuss further.
Cheers
IV

2. grant

Say a machine learning technique was used(one that could easily handle many predictors, maybe stochastic gradient boosting)to generate a distribution. To do so would mean tweaking the inputs to the algorithm to generate enough sample paths to compare it with a distribution generated by monte carlo methods. I am trying to figure out a way to test the accuracy of these different ways of generating a distribution. Would it work to just make say 100 forecasts(and having the actual data to validate against) and then compare these forecasts to the actual data. So look in the upper 90% quartile of the forecasted distributions and then determine if indeed about 10% of the time the actual values were found in that upper 90%. Would this work at all?? I am very interested in this question and have been thinking about it quite a bit lately.