2016-11-10

*This post first ran March 2016 via ANNUITAS.com

If you have spoken with a data scientist in the past few years, you’ve probably heard him or her talk about “R.” R is a statistical programming language created in 1993 and is the modernization of the classical programming language “S.” According to the most recent O’Reilly Data Scientist Salary Survey, R is used by more than half of all data scientists and could very soon outpace Excel’s usage in modern data science.

R is incredibly effective because a few lines of code can completely and repeatedly ETL (which is data science jargon for Extract, Transform, and Load) data sets containing hundreds of thousands of rows and columns. Since learning R, I have managed to minimize the ETL necessary for a recurring data project from ten hours of work down to only ten minutes. Given R’s amazing potential to increase productivity by reducing the time needed to complete data projects, all businesses in 2016 manipulating data or statistics should leverage R or a similar statistical programming tool.

Where R Trounces Excel-

I’m sure that many skeptics will read the above and think “Why should I learn R? I’ve been doing my data cleaning and reporting in Excel for years and I’m doing just fine.” Certainly, Excel is an incredibly valuable tool for being able to view complete data sets, make spot changes, and tease out some simple statistics, but those are the only things that Excel really does efficiently.

The moment your data set surpasses 50,000 or so entries, you will be spending more of your time waiting for Excel to load than you will spend actually working. This is where R truly shines: in the two minutes it would take for Excel to even open the spreadsheet, R can both load your spreadsheet and complete whatever ETL you need.

Speed is not the only reason that R is a more powerful tool than Excel. Below are just a few of the data pain points I’ve used R to solve:

De-duplicated data based on multiple fields while still prioritizing one field over the others

Blended spreadsheets to add more data to each entry based on one or more fields

Recombined an enormous database that has been broken up into multiple spreadsheets

Reoriented data and optimized it for use in Tableau

Put all the dates in a dataset in the same format

Fixed labels for each record so they can properly be categorized and reported on.

How Do I Implement R? Does It Cost Anything?

The best part about R is that it’s free. Simply download R from the R-Project website, install it, and you’re ready to go! Now, if you open R, you’ll notice that the interface is rather Spartan. Because of that, I’d also recommend installing RStudio, another free product that makes R a lot easier to work with. RStudio keeps track of your coding history, lets you write code in a separate environment from the console (where you input code), looks up R help files, and keeps track of the packages you have activated.

Now comes the hardest part: learning the language. While you will need to spend a few weeks studying the code, I encourage you to think of this as an investment as opposed to extra work. Late nights or weekends fixing large datasets will completely disappear once you have even a basic fluency in R.

Additional resources are available to help you learn the language more effectively:

Johns Hopkins University’s Data Science courses on Coursera

Learning R by Richard Cotton

The R-Bloggers website

The trick really is to learn how to “think in R,” so to speak. In particular, I’d recommend the courses from Johns Hopkins to achieve this end. The professors are knowledgeable, the lectures are in-depth, the projects are relevant, and best of all it’s free (unless you want to pay for the certification of completion for your LinkedIn page).

A Bright Future Ahead-

The above barely scratches the surface of what R can do. I am still learning new and amazing things to do with the language every day. Luckily, ANNUITAS is as interested in the tool as I am and has been gracious enough to help facilitate my mastering of the language. We have invested in the future of R by becoming a Benefactor of the R Foundation—a non-profit group dedicated to ensuring the continued development of R—alongside other large enterprises like Merck and Shell.

As more people adopt the language for their businesses and research projects, their innovations will be woven into the already intricate tapestry of R, enhancing and coloring the experiences of its entire user-base. There is no better time to integrate R into your workflow than now.

Author: Scott Parent @ScottGParent Optimization Strategy Manager, ANNUITAS

The post Stop Wasting Time in Excel for Data Management- Focus on R appeared first on Annuitas.

Show more