The ideal Quant Environment

R is a wonderful tool in that one can prototype and test ideas very quickly.   If your not using R and doing most of your work in C++ or other low-level language, your missing a lot.   The speed of development between a C++ / Java / C# versus R is 10-50x.

R definitely has its warts as a language and environment.  I not a huge fan of its matrices, index operators, or lazy expression parsing.   More crippling is the fact that R is slow and has memory issues for large data sets.    I would estimate that R is 100x slower than java or C++, depending on what you are doing.

My current environment is a combination of R and a number of lower level languages.   For much of my post-exploration work I feel compelled to write in a lower level language due to the performance issues with R.   My preference is to be using a functional language, as they are generally very concise and elegant.

Ideal Environment
Here is what an ideal environment for me would be:

  1. breadth and depth of R
  2. clean functional language design
  3. concise operations (as close to the math as possible)
  4. excellent rendering facilities
  5. real-time performance
  6. ability to work with large data-sets (memory efficient)
  7. concurrency (I do a lot of parallel evaluation)

Candidates
Here are candidate environments that I’ve used or explored:

  1. python
    cleaner, generally faster than R, very little in the way of statistics and poor integration with R.   No real concurrency as interpreter locked.
  2. Ocaml
    Beautifully concise language, INRIA implementation does not support concurrency
  3. F# variant of Ocaml
    Solves Ocaml’s problems but bound to MS platform
  4. Scala
    Excellent performance, a bit bleeding-edge, much more complex than Ocaml, but on the JVM.

The Special Blend
It is impractical to consider reimplementing even a subset of R into another language environment.   A hybrid approach makes sense.   With python you have Rpy and with Java  JRI.   Neither of these have first class interaction with the language though.

I would like to have the power of a production functional language, but with the same development ergonomics and interaction that R has.    It occurred to me that I could do the following:

  1. dump function templates of all R functions in one’s environment into a Scala module
  2. create implicit conversions from R fundamental types into Scala objects (matrices, vectors, data frames, etc)
  3. create specialized mappings between my Scala-side timeseries and R zoo / ts objects
  4. create some wrappings for unusual usage patterns such as the ggplot operators

Scala has an interactive mode where functions, classes, etc. can be created on the fly.   Because we’ve dumped proxy function for each R function, we also have near first class access to R functions.   There would be some differences in that would not be able to replicate the lazy evaluation aspect of some of the R functions.   Functions that use expression() would have to be wrapped specially.

Because scala allows a lot of syntactic magic, the environment would look very much like R and have complete access to R, but with the huge upside of the more powerful Scala.    One can write code that is “production ready” from the get-go and/or do very compute intensive operations otherwise prohibitive in R.

Now I need to find the time to put this together …

Addendum
I have not decided on which language / environment to base this on.   Scala’s main sell is that it is on the JVM.   The other functional language contender on the JVM with above-scripting-level performance is Clojure.

Clojure is basically a dialect of Lisp and there is already a project  called Incanter that provides a statistical environment within Clojure.    It looks interesting, if early.   Clojure does not yet have a performance profile that is close enough to the metal.   I would expect to see improvements over time though, but due to Lisp’s lack of static typing or type inference, I am doubtful that will see Clojure at the level of statically typed languages.

Since writing this post and having some conversations, I’ve begun to think that F# may be the best choice.   My language preference has been to use Ocaml, but the INRIA Ocaml implementation is handicapped.   F# is closely related to Ocaml and therefore may be a fit.

F# on the MS .NET platform has been shown to be as performant as C#.    From benchmarking C# a couple years ago, was clear that the CLR is pretty close to the JVM in terms of performance.   Given my cross platform constraints, the question has been how viable is F# with Mono from a performance point of view?

It seems like the Mono performance  is being addressed.    The mono LLVM experiment improved mono benchmarks significantly.   I have not been able to test Mono with this extension.   Will have to experiment.

To wed this to R would require writing the equivalent of JRI for .NET / R.

Advertisements

36 Comments

Filed under strategies

36 responses to “The ideal Quant Environment

  1. Curious to see how this works out. Bit skeptical of the conceptual impedance mismatch, but worth a try to see if Scala sugar can glaze it over cleanly.

    • tr8dr

      Well, writing code in the Scala/R environment would certainly be different in that you could be writing Scala functions, classes, etc.

      i.e.:

      def f (x) =

      versus:

      f = function (x)

      Matrices, vectors, and timeseries would be manipulated locally in Scala, avoiding the roundtrip, but also allowing for more concise syntax.

      Since Scala allows for implicit type conversion, would avoid the ham-fisted approach required with the JRI and PyR approaches. Can also see R functions as first class as opposed to evaluating R expression strings.

      Under the covers of course there is R code that is being “evaled” and data that is being transfered in raw form, converted into matrices, vectors, etc. as appropriate.

      I’m not sure when I’ll have time to do this, but hope to do something by Spring.

  2. I will also be interested to see what you find.

    If you’re looking at OCaml, you might want to have a look at what Jane Street Capital is doing with it (e.g. see this video: http://ocaml.janestreet.com/?q=node/61).

    Beyond that, I would just mention that it is my impression that Clojure is just as fast as Scala (although there are very few available benchmarks), and its Java integration is far superior (not to mention the many other functional benefits of the language). Also, several people (including me) are working on R integration with that language.

    Lastly, any reason not to investigate Haskell as well?

    • tr8dr

      Shane, I saw your excellent presentation on the Clojure binding. I was surprised to see your performance comparison for the primes function, where R did quite a bit better.

      In my tests, Scala matches Java performance (which is already 90% of C++ in many cases). I had read quite a bit about clojure and had gotten the impression that there is a performance gap between it and java for numerical work. I’m quite comfortable with lisp however.

      I guess I should spend a bit of time with clojure and do some benchmarking.

      Haskell is an elegant language. I like that everything is lazy, though this translates into a lot of work for a compiler to optimise away. I would prefer to be connected to either the JVM or the CLR so can access the large base of libraries. The large momentum around both of these VMs would tend to make them the most performant and most robust.

      There is Jaskell for the JVM, but is not suitable if you need to get close to the metal.

      Scala is not the best language design out there (it is complex if u dig into it), but at least is close to the metal and is on a VM. F# might be compelling if the mono – LLVM mapping matures and is competitive with Java.

    • tr8dr

      One more note. I had read one of the papers by the Jane Street guys a few weeks ago. I was impressed with their approach and thrilled to see a prop house taking a bold step with FP.

      However, they mentioned that INRIA is not interested in solving the concurrency problem and the project is run like a “cathedral” rather than bazaar, so is really up to the whim’s and priorities of the folks at INRIA.

      That kind of turned me off. Now Jon Harrup is developing a VM called HLVM for Ocaml. The VM is built for parallelism and high performance from the get-go. My concern is that it is a one man show. I have a huge codebase and want to be conservative about where I invest time in remapping and/or rewriting large parts of it.

  3. Interesting. I only use R for analysis (and I hasten to add – much less handily than you!) but then code what I want into java or c. If I wanted to access R from a strategy (which I don’t as it’s too slow and I like to minimize ‘heavy’ interfaces), I think I’d do it directly from c.

    I agree that a functional language is handy, but my look into scala left me with the impression that it’s a bloated language built on a bloated environment. Not a candidate in my book, but I’ll be curious to hear how you progress..

    btw, could you post or link the jane st paper you mention? thanks!

    • tr8dr

      I’ll have to admit the following:

      – I’ve never considered calling out to R from production code before
      – I agree with you on Scala. I need to find something practical though.
      – I’ve not yet decided on Scala, but that I can access a huge Java codebase is the main selling point

      There is a good chance I’ll use Ocaml, as I have a strong preference towards it as a language, but I have to find a way to make it practical.

      Mono’s recent experimental mapping to LLVM has improved performance dramatically, but is still an early impl. It may allow for very efficient evaluation of F#. I can also easily map my code base into the C# VM.

      I am more thinking about, even during proof-of-concept work, I may need to calibrate a state system, do an occassional MC evaluation, or just do something on large timeseries. Doing that in R as current is insanely slow. So, since I spend most of my research time exploring and proving ideas, a hybrid env would make sense for me.

      In some email exchanges with Shane, I began to think about whether it could also make sense to use R sparingly in a production env, calling out from a low-level language. At this point I have a large model base in Java / C++ / fortran and find myself rewriting something that has already been done in R from time to time.

      Here is the link to the paper I read:

      http://www.janestreetcapital.com/minsky_weeks-jfp_18.pdf

      • quantivity

        In all seriousness, I am curious what you believe are the key use cases driving adoption of functional (beyond many of us are simply bored with modern VM-based OOP, which is certainly driving much of the Scala momentum)? Undoubtedly many trading abstractions can indeed nicely be modeled via event-driven, actor-like patterns; but, the business risk of heavily investing in a potentially dead-end language seems excessively high given the potential comparative benefits (thinking of languages de jure over the past 15 years, and the fact none of your code is written in any of them).

  4. tr8dr

    @quantivity Well, I know what you mean by dead-end language, but for better or worse, we tend to have more of the “undead” ;). Consider that FORTRAN is still alive and well. I’d certainly like to see it disappear.

    As for functional languages, these had largely languished in academic settings for the last few decades. I would argue the main reason for this is that the authors were more interested in programming language research than in making them practical for many years.

    In recent years a handful of languages from the FP space have made it to mainstream with excellent performance, good tools, and polish. We are now starting to see quite a bit of interest in the broader community. I am therefore doubtful that languages such as F# are going to disappear. Of course we have R as an example of a very successful functional language as well.

    Now why functional languages? For me it is about my ability to be more productive. FP tends to be *much* more concise and require much less boiler-late. Functional languages are just *that* more productive to make the paradigm shift worth doing.

    The gap between my productivity in R and the cost of development in Java, C#, or C++ is huge. R is not even an especially powerful functional language as it stands.

    The reason why I’ve waited so long to use FP for “real work” is for reasons you mention. Namely, I need to see that there is a production quality implementation and enough momentum behind it to carry it forward.

    As for Scala, I don’t think it is about boredom or rather I don’t think that is the main reason. I think many area approaching Scala not because it has FP, but because it is generally more concise and has important features that have handicapped Java.

    That said, my experience thus far with Scala is that it is too complex and lacking the simplicity of other FP languages (it seems to have come from the C++ school of language design). I am actively exploring whether F# in conjunction with Mono is workable …

  5. You may be interested to know Guillaume Yziquel has recently released a revised version of Ocaml-R, which is a binding embedding the R interpreter into Objective Caml code.

    Also, Dirk Eddelbuettel created the RInside package, which provides C++ classes that make it easier to embed R in C++ code.

    Lastly, I’ve been working with a group of guys in Chicago to build trading infrastructure in R. One goal is to handle large amounts of data quickly (e.g. pulling subsets of 100+ million rows from 10GB+ data sets). I’d be interested to hear what your needs are. If there’s not a faster way to do it in R, there should be. 😉

    • tr8dr

      Josh, thanks for the note. The Ocaml-R package sounds very interesting. I will definitely take a look at it.

      I’ll answer your last question in two ways.

      1. What would R need to be a complete environment for me
      2. What particular needs do I have

      Complete Environment
      – compiled functions: the performance gap between, say, java and R is some orders of magnitude
      – better memory utilization for timeseries and matrices
      – clean up language and data structure warts
      – would love to see a gradual move to a more robust VM and language or R alternative develop

      Needs I have:
      – I tend to do a lot of machine learning and iterative algorithms, so performance key
      – I use large data sets

      Currently I use rJava for work that needs performance, but ultimately the hybrid is too disruptive. What R has going for it is a vast community. S+ has a lot of baggage, some good and some bad. Some language features will make R especially hard to compile. My view is that a new environment is needed. Perhaps the best that can be done is a hybrid environment given the huge base of packages …

      • Not to sway your decision, but to help me understand:
        – Are you refering to compiled functions in base R, user packages, or both?
        – What would better memory utilization look like? Floats instead of doubles? What time series class are you using?

        My experience with rJava performace was so-so (in the LSPM package). We got quite a performance boost from moving the Java to C, which I attribute to the R/C API being better than the R/Java API.

      • Having rewritten OCaml-R, here’s a few comments.

        Unfortunately, development of OCaml-R is currently stalled, and here is the current status.

        The development of OCaml-R focused on two main ideas: Wrapping R functions in OCaml as first class values, and attempting to get a strong typing of R code that you may want to run in OCaml. And making things as static as possible (“compile-time” initialisation of the R interpreter, for instance) to allow to make OCaml bindings of R packages not too unnatural. The binding works and is fairly robust (I believe), but suffers from some shortcomings.

        Enhancing the type system ran into issues that are now solved in OCaml 3.12, but I do not have enough time to follow on that. The garbage collector interface between R’s GC and OCaml’s GC currently frees n values in quadratic time, which is bad (solving this requires and enhanced R API, or a garbage collector implementation in the binding itself). Multithreading R code is difficult in R itself (parrallelising is another issue), and an enhanced R API with a callback inside R’s evaluation mechanism could allow for Lwt-style multithreading of mixed OCaml/R code.

        Another tough issue is R’s calling conventions. If you want to reduce the overhead of calling R code from foreign languages, you have to present R function calls written in OCaml in a syntactic form suitable for R’s calling conventions. Passing an OCaml list of arguments, names, etc… and converting this to R is clumsy overhead. I therefore believe that you would need a Camlp4 syntax extension just for this.

        Ideally, you’d also need to do some R introspection and type inference (as far as that is possible) to be able to write OCaml code such as

        module X = R module quantmod

        This is probably what I’ll try to achieve next, as that would make the binding easier to use (hopefully with less necessary subtyping). Typing hints would probably be necessary, but it would still be better than nothing.

        As to compiling R, I believe that it’s a tough issue. Nevertheless, it might be worth a try, following guidelines such as

        http://www.venge.net/graydon/talks/mkc/html/index.html

        It seems feasible, but obviously quite a lot of time would be needed to deal with all language quirks in R.

        But as far as I am concerned, this is too remote for the current status of OCaml-R. Enhancing the expressivity and security of the OCaml-R type system, dealing with the quadratic time garbage collection interface, and having a decent Camlp4 syntax extension would already be great stuff (as far as my small available free time goes).

        Contributions are welcome.

  6. tr8dr

    I meant having a JIT compiler for R, so that functions I write a fast. Of course parallelization (via snow) can help, but it is a shame that much of the cost is in evaluating R itself as opposed to real computation.

    As for memory utilization, not trying to economize on floats vs doubles. I can use all of the precision. Merely noting that matrices and other data structures use multiples of memory that “equivalent” structures in Java or C++ would.

    As for rJava, it is not a terribly efficient interface. On each call through rJava there is a lot of R code that is evaluated. The java side of it is also not all that sophisticated. So yes, if called at high frequency has a good amount of overhead.

    As for timeseries, because I more often than not am working with irregular timeseries, I use zoo. I ran across a package that unifies these but have not played with it. Any recommendations?

    • Though it’s probably not the extent of the JIT you’re looking for, Ra is a step in that direction.

      I still don’t understand the memory utilization issue, but I don’t know Java or C++ very well. This toy example is close to what I’d expect:
      > x = matrix(rnorm(1e6)) # 1mm doubles
      > object.size(x) # 8mm bytes
      8000112 bytes

      I’d recommend using xts. It extends zoo, but it’s optimized specifically for time series. It’s significantly faster in some areas (though many of those speed improvements are being back-ported to zoo).

  7. Johan

    Looks like Microsoft is pushing for better cross platform support for F#: http://blogs.msdn.com/dsyme/archive/2010/03/10/contract-position-in-the-f-team-compiler-and-visual-tools-software-engineer-for-cross-platform-f.aspx

    Combine this with the statement from Miguel de Icaza’s blog (http://tirania.org/blog/archive/2010/Feb-17.html):

    “We are working to improve our support for F# and together with various groups at Microsoft we are working to improve Mono’s compatibility with the CLR to run IronPython, IronRuby and F# flawlessly in Mono. Supporting F# will require some upgrades to the way that Mono works to effectively support tail call optimizations.”

    Sounds promising for the future.

    I’m also in the stage where I’m contemplating what to write my ATS in, and so far I’ve ignored F# more or less due to the lack of good *nix support. I’ve looked into Clojure from the concurrency point of view, but I think I’m gonna hold out for a little bit longer until I commit myself to anything. Optimizations done to the core and more work on Incanter might make it worthwhile for me later, but then F# on Mono might perhaps work better anyway. And I still think it’s easier to read F# code than Clojure code. 🙂

    So hopefully I’ll survive with R sprinkled with some C until then. 🙂

    • tr8dr

      I have to agree with you regarding clojure (and yes, there are current performance issues). I’m going to move to F# (together with my libraries) as soon as I feel the environment is workable.

      I saw that posting on Don Syme’s blog. Sent him a note to clarify what was meant by cross-platform. I would suppose that he can’t reveal. Hopefully good news.

  8. I am doing data mining and machine learning, and a big fan of F#. Most students and researchers in this community are using Matlab, C/C++ and Java. Some use R and Python with Numpy, Scipy, but the number is not big.

    For the criteria of getting things done dirty and quickly, Matlab is by far the best environment I have seen. C/C++ is for performance. Java is good engineering, code in Java is more reusable.

    I write good C code and use Matlab for research in the past. Now, I am moving to F# as a person hobby. It is just fun to write F#!

    I am starting writing a blog for F# and data mining: http://fdatamining.blogspot.com/ I hope later I could finish a set of wrappers (just like Incanter did) to the existing state-of-the-art data mining and machine learning libraries, and contribute the framework to the community.

    • tr8dr

      Nice. Looking forward to your F# work / posts. How do you find F# performance-wise? I read one of your posts regarding matrix operations and it seemed that the .NET CLR is not clever enough to avoid array bounds checks? The Java JVM can usually avoid array bounds checks and gets performance in the C ballpark.

      I was also surprised to see that the Numpy operations were as fast as reported. My impression with python is that it is or was 10 – 50x slower than java / c++.

      As for matlab, matlab has a well developed library set useful for ML and signal processing applications. That said, I find the language to be poor relative to R and of course F#.

      I’ll most likely be gradually moving over to F#

      • In my experience, F#’s performance is close to C#’s, sometimes a little better if we write imperative code in F#.

      • However, F#’s memory usage is still kind of unpredictable, at least it usually costs more than I expect. In a functional language like F#, short time objects are created more frequently. But .Net GC currently is not optimized for F#. so.

        Java costs about 24 seconds on average to perform A*A’, where A is a 280*10304 random matrix. .Net managed code cost 13. Native C cost 8. My self-made blas.dll costs 4. Matlab and Numpy cost 0.4…

  9. Johan

    Btw. Have you looked anything at Lua, or more specifically at LuaJIT?

    I just found out about it and judging by the tables at http://shootout.alioth.debian.org/u32/benchmark.php?test=all&lang=all&d=data&calc=calculate&gpp=on&java=on&luajit=on&v8=on&lua=on&tracemonkey=on&box=1 the performance appears to be very good.

    Haven’t looked around much yet but some interesting projects levering Lua seems to be http://torch.sourceforge.net/ and http://www.nongnu.org/gsl-shell/

    • tr8dr

      Lua looks interesting in that is quite performant. It does remind me of javascript or C in its structure. From what I can tell, datastructure capability in lua is quite basic (i.e. no classes). I noted in the debian shootout that mono F# is not all that far behind Lua as well …

  10. I’d recommend Haskell, I’ve just rewritten a monster Genetic Programming library from C to Haskell and the codebase is a tenth of the original size. Performance has suffered, as expected, but I’m looking to use the Data.Parallel.Accelerate library to automatically parallelise the code over GPU’s – which will be a small code change compared with a major rewrite of the C code.

    • tr8dr

      Thanks, Haskell is a world of difference from C and a very rigorous functional language implementation.

      I’m leaning towards F# because I think it is the most practical functional language available. Haskell does have performance issues due to its academic / uncompromisingly pure-functional nature. I do a lot of matrix math and/or high dimensional array manipulation. I also like the mixed model where I can have OO mixed in with functional. For very large code bases OO provides a nice structure.

  11. Scott Locklin

    My needs and your needs don’t overlap significantly, but I find Lush to be extremely useful. Easy to pull in libraries, reasonably easy to compile stuff, and it’s got the ability to do some data interaction. Not as pleasant or high level to work with as R, but it’s a useful set of compromises.

    If I had to do it over again, I’d probably have done it in Python + SWIG or OCaML.

    Not much in the way of concurrency though, unless you’re into stuff like MPI.

  12. tr8dr

    Lush looks very nice and allows direct inlining of C code. I used to do a lot of scheme back in the day, but gave it up many yrs ago.

    You are the second or 3rd who has recommended it. Of course another big recommendation is that it is Yann LeCun’s playground, so should be full of all sorts of useful stuff.

    I am now seriously looking to move to a combination of F# / C# in mono. The mono project has moved forward enough in terms of performance to make it viable for me.

    As for MPI, I used to use PVM (MPIs predecessor) and Linda for distributed / concurrent apps. These days I get enough mileage with either threading on available cores or other distribution techniques.

    • Scott Locklin

      I’d be very surprised if I was the second or the third to mention it, unless you’re referring to a conversation we may have had on the Reactor. Yann and Leon have some very neat ML doodads in Lush, but the real strength of it is that it contains everything you need to be productive.
      Downsides: the new version doesn’t have MPI (I’m an old school PVM/Cray hacker myself) thus far, and is slow in coming.

  13. Tanmay

    Hi, I am revisiting this post, and would think that python would be the language that meets most of your requirements. Have you tried python lately with the power of pandas, sk-learn and scipy?

    • tr8dr

      Yes, particularly with the advent of ipython notebooks, became completely sold on python as a research environment. I rarely jump into R now. In addition to the notebooks, what I like about python is:

      – data structures / classes & ability to write fairly large libraries because of that
      – performance (well better than R)
      – seems to have a bigger ML community

      What I think is lacking:

      – visualization is not quite there yet
      – matplotlib looks like it is from the 90s
      – bokeh is still early (though very interesting)
      – ggplot for python is very early

      • Tanmay

        Hi, Thanks for replying. I am sure you would know, but just in case you didn’t you could use seaborn package on top of matplotlib to enhance it a bit.

  14. tr8dr

    @Tanmay
    Yes, I’ve looked at and have used seaborn. It looks nice and moves into the direction of modern rendering. However can be a bit quirky, for example while “jointplot” looks great and is very useful, it cannot be embedded as a subplot (i.e. on a grid, etc) as it “owns” the figure. I understand why it was implemented that way, however.

    I am getting more interested in bokeh because of the interactivity it supports. For example, I want to be able to mouse-over parts of a time series and see corresponding order book or other data. It still has a long way to go, but has a decent level of functionality today.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s