While I'm an old hand at PC building for gaming purposes, I am very new to programming and was wondering if anyone had any recommendations on a relatively easy to use langauge for the purposes of statistical work that also manages RAM usage in a semi-decent manner.
In short, I have been running R based calculations as a part of my thesis (population genetics related) and have found the RAM requirements to be simply nuts. At current I am having to run things on a node of the cluster in my department with over 100 gigs of RAM in order to not have the job fail.
To be brief, I have a couple of hundred thousand genetic markers and three populations (n=63, n=192 and n=203 individuals respectively), for which I am running pairwise marker by marker statistical comparisons (A-B, A-C and B-C) that are then x1000 bootstrapped for 95% CIs as a part of quality control. As you have no doubt gathered R seems to be very inefficient at this (I have faith in how it has been coded, the person who helped me with the script is involved with official R support).
Some of the bioinformatics folks in my lab have mentioned that R is just a pain like that, and that other langauges may be better - I was just wondering if anyone had any pointers on where to start.
try java i guess. I'm not a programmer but it has built in memory management.
Why not ask your R support guy all these questions? Basically, R is the best free and open source tool on the market. However there are alternatives, such as SAS, SPSS, and some others. They will cost you an arm and a leg, specially compared to free.
I have asked him, the RAM usage issues are pretty much unavoidable - the software just doesn't 'clean up after itself' very well when looping a function.
The work I am doing requires custom scripts - I have SPSS and it cannot do the kind of work I am doing. So I think I will need to try Perl, Python or something else - I'm just not sure where to start.
Can you be more specific with respect to code?
What kind of statistics are you doing? Many existing libraries in R actually use C underneath, so they are much more efficient than any custom code you write in R.
How do you store and load data? You can try to read one line at the time from a data file instead of loading an entire data file as data.frame/matrix.
Similarly, instead of storing results of pairwise statistics in a data.frame/matrix you can directly write it into a file.
Try using as small number of custom written functions as possible since values to function parameters are passed as copies rather than reference.
Have you tried using family of "apply" functions instead of loops? Some say that apply functions give better performance, but I suspect it may be just an urban myth. But would not hurt trying.
There is a function called .ls.object (something like that I cant remember exactly) for tracking objects and their memory usage. You can identify junk objects using your memory.
Radical solution is to save essential objects to file after several iterations and then clear R from all objects in memory (usually, in windows it is "rm(list=ls(all=TRUE))"). Then load saved objects again and continue from the last iteration.