Anyone using R?

Hi all,

I was wondering if anyone is currently working with R in a production setting? The big thing I'm looking for right now is the best approach to cleanse data using R - packages, tutorials, etc. I have plenty of data problems at work, so I have plenty of fodder for learning.

Any pointers would be much appreciated.

Thanks!

I use R on and off and am probably a novice, but I found this introductory book helpful: http://www-bcf.usc.edu/~gareth/ISL/ and it is free on the website. Just curious by what you mean by production setting, are you using this with data retrieved from a database? How large are your data sets?

Here are some good packages to know:
plyr (reference This paper was a good read)
dplyr (tutorial This package should replace most of plyr, but I don't have too much experience with it yet and I haven't found a paper for it like plyr, but there are other tutorials online)
ggplot2
tidyr
reshape2

I have mostly used the statistical tools from base R for pca, svm, and regression.

Also R-bloogers is a nice site to have a subscription. I just got this from them which might give you an idea of what other packages people use: http://varianceexplained.org/r/seven-fav-packages/

@heffjos -

I apologize for the wall of text. Hopefully there are some useful bits in there for you.

Intro/Background
I have never developed in R before, but my undergraduate degree was in software engineering and I'm working on a Master's in BI & Analytics (Stats). I'm not too worried about learning a new language. I think the challenges I see right now are: figuring out our Dev Stack, learning R/Azure and data collection/integrity.

I'll probably get some Microsoft hate, but I work at a full Microsoft Shop. I mean top-to-bottom. O365, Azure, etc. The only other big things MS makes we do not have are their CRM and ERP. I work for decent size auto manufacturing company. We have a whole slew of engineers and manufacturing plants in the US and across the globe.

My Mission
I've got my marching orders - utilize the Azure stack to it's fullest. The cool thing is we're just getting into the Machine Learning, Data Mining, Neural Network type stuff. Really that's a small fraction of what I need to do. I'm researching the tools, processes, best practices and setting the Dev Standard.

I've been looking at Microsoft R Open, Microsoft R Server, Microsoft R Client on SQL Server 2016 R Services, Visual Studio R Tool Kit, R Server for HDInsight, Azure ML Studio and I'm trying to get into the beta for Open Mind Studio - not sure if there is R in there or not.

Proof-of-Concept
I'm working on retrieving data recorded from the PLCs (Programmable Logic Controllers) and CMMs (Coordinate Measuring Machines) and both sets come from a MS SQL Server database. A third source, which won't be ready for the Pilot, is from our Material Science Lab (MetLab). This is all for one product line and we will start with one base multi-regression model.

The data isn't too huge yet. We're pulling data for the month of June, 2016. It's 91K rows x 900+ columns for the PLC data and is serialized. The CMM and part of the MetLab data are only associative temporally (date/time), so we have some work to do there. The data is fairly clean, but will need some scrubbing.

Post Pilot, the data will get biggish - TBs in size to start with. As we add more sources, potentially even larger. Honestly size isn't even the REAL issue - it's stuff that's being recorded on paper AND data integrity. Data is at best 3rd place in the mfg. production environment - gotta change that.

We're working with the Azure Platform to ingest, stash, prep, model and visualize it. This is all to get a better understanding of the process, the data and get buy-in for bigger projects.

As you can see, Microsoft put its acquisition of Revolution Analytics to work.

I have MUCH to learn.

I've got some tutorials I came across but they also are for IPython... Gimme a minute I'll find them for you

Late post as I don't check TS often anymore, but when I do I pull up coding and saw this.

Anyway I was really big into R about 10 years ago and stopped using it all together about 5 years ago. Anyway, I think it's alright if you're not doing anything seriously compute intensive. The real nice thing about it at the time was that there were a lot of statistical packages available for it that simply weren't available. The situation has definitely changed though.

There were some issues that I found deeply embedded in the design that really bothered me and led to me abandoning it (e.g. data representation, lack of serious support for serialization, no proper multithreading).

If you're married to R, then I'd say it's worth taking look at if HDF5 (assuming your data is row-wise) support has gotten better since I last used it. The speed is a lot nicer than SQL. Now if you're not married to R, then python with pandas, scikit-learn, bcolz, etc.. all the way.

Hope this was helpful.