2016-12-09

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

At the University of Utah, I teach the R lab that accompanies MATH 3070, “Applied Statistics I.”” None of my students are presumed to have any programming experience, and they never hesitate to remind me of that fact, especially when they are starting out. When I create assignments and pick problems, I often can write a one- or three-line solution in thirty seconds that students will sometimes spend four hours trying to solve. They then see my solution and slap their foreheads at its simplicity. I can be tricky with my solutions. For example, suppose you wish to find the sample proportion for a certain property. A common approach (or at least the one used in the textbook our course uses, Using R for Introductory Statistics by John Verzani) looks like this:

But if you realize that x == "USA" produces a vector of boolean values that, when coerced to numeric, become 0’s and 1’s for FALSE and TRUE respectively, and that the sample proportion is the sample mean of binomial random variables, you can get a much simpler and easy-to-read solution:

This is one of those tricks that students do understand at some level, yet it still blows their mind when they see it, and there are many other things they see that make them somewhat envious. So I do have a lot of students ask me, “How did you get so good at R?”

As some background, I fiddled around with programming as early as when I was ten years old; my dad was a computer programmer and he introduced me to QBASIC to get me started. I also fiddled with Visual Basic, and while I never really could program well, I did understand some basic concepts and I could make simple text programs with QBASIC. In high school, I found my dad’s old C textbook and I went through it, chapter by chapter, working half the problems in each on my Linux laptop, thus giving me an even more solid understanding of programming. That said, I was not one of those teenagers hacking into the high school’s network and released a homebrew virus that played a fart sound every time someone clicked the mouse; I never did any projects and mostly just let any skills I developed languish. I did not have any structure or purpose, and truth be told, I did not want to be a programmer.

The class I teach is also the first class where I learned R. (It’s even in the exact same room, and I like to harass my students by saying “I’ve been in the exact same seats as you; I sat in that general area over there.”) I did well; I already understood many of the ideas basic to all programming. After the class ended, I tried fidling around with some real data sets, looking at the correlation between homocides and guns per capita (then stopped when I realized I did not know what I was doing). Following that I worked as an intern in Washington, DC, for a semester, where I did absolutely nothing with R and I would not use it again until I took MATH 3080 (the second part of MATH 3070) and the R lab accompanying that class. Those two classes were sufficient to use R for future class projects and a real-world project (that got me media attention). (You can find my lousy first R scripts for that project here.)

December 8th is the last day of the R Lab for MATH 3070 in the Fall 2016 semester. Some students will continue on to take MATH 3080; others will go elsewhere, and many will need to use R sooner rather than later. I’m aware that they are not yet R pros, so I’ve written this blog post to give advice for not only learning R and programming from here on out, but also contributing to the R community. This is largely based on my own experience, and I invite other R users to share their own advice and stories in the comments.

Classes

One way to learn R well is to take more classes. For University of Utah students, you may have already taken MATH 3070 (perhaps from me), and you may take the MATH 3080 R lab to learn more about using R for statistics (naturally, you will be learning to use R to solve problems discussed in that course, such as ANOVA, linear regression, goodness-of-fit analysis, nonparametric tests, and so on).

To my knowledge, that’s the extent of undergraduate courses for learning R. Furthermore, the only course I am aware of that focuses on programming in particular, and covers R programming, is STAT 6003, entitled “Survey of Statistical Computer Packages,” which covers R in addition to SAS, SPSS, and STATA. Otherwise, you would be learning R by taking MSTAT courses such as MATH 6010 (linear models), MATH 6020 (multivariate models), or MATH 5075 (time series). These courses don’t teach R so much as use it for applications of the mathematical topics of interest (though in the MATH 5075 and 6020 textbooks there are R examples). Additionally, I am not aware of classes in the computer science department that teach R beyond what you would learn in the R labs from the Mathematics department; CS 6190, “Probabilistic Models”, uses R for Bayesian analysis, but that class is not at all for undergraduates.

Thus, for University of Utah students, the MATH 3070 labs may be the last courses where you learn R programming; you are expected to learn R from here on out on a need-to-know basis. That said, you could look for free online courses to sharpen your R skills. Perhaps look at Coursera and the R programming course provided by Johns Hopkins University if you want a structured class. You could also look at DataCamp for a less structured approach; you can find video lectures with problems where you can learn more about some R topics.

If you’re willing to shell out money, perhaps look into R workshops. Some big names in R, such as Hadley Wickham, offer R workshops on various topics throughout the year. That said, be prepared to spend a considerable amount; there’s a good chance that you would have to travel a distance to even get to the workshop.

Books

I like the textbook the Mathematics department uses for the R lab, John Verzani’s Using R for Introductory Statistics. I wrote my lecture notes using his book, and even going through it for a second time I discovered functions and techniques that I was not familiar with and improved my own abilities. For students continuing to MATH 3080, keep the book; we will be using the same one. Otherwise, if you really want the money for the book back, you could use Verzani’s original R notes, referred to as simpleR, upon which his book was based, but the second edition of the book represents a significant improvement over even the first. If you don’t need the money, maybe consider keeping the book.

While Prof. Verzani’s book is good for R in the context of introductory statistics, it does not say much about R’s inner workings. It still treats R from the perspective of a humble user, not a programmer or power user. If you want to learn R as a programming language, I would highly recommend Hadley Wickham’s book Advanced R, available for free online or in print. Hadley Wickham is seen as a major authority in the R community; he’s a prolific package author, wrote many of the best R packages in existence (in my opinion), and clearly has a deep understanding of the R language. Even reading two chapters of Prof. Wickham’s book improved my R skills tremendously.

There are many, many publishers and authors writing about R, and you can find books from O’Reilly or even R for Dummies (though I would hope my students would be beyond the for Dummies level). That said, one publisher I would like to highlight is Springer. Their Use R! series includes many great books on topics specific to R, such as certain R packages or common R tasks (such as Hadley Wickham’s book, ggplot2, which can be obtained for free from the University of Utah library and I highly recommend anyone read). In addition, many Springer books on more general topics include R examples, allowing readers to learn both the theory and the R application (some examples include the MATH 5075 textbook, Time Series Analysis and Its Applications with R Examples, by Shumway and Stoffer; and the MATH 6020 textbook, An Introduction to Applied Multivariate Analysis using R by Everitt and Horthon).

Another publisher with an impressive collection of R books is CRC Press. CRC Press published the textbook used in the R lab, in addition to Hadley Wickham’s Advanced R. CRC Press also appear to be the publisher of choice for Yihui Xie, the second-most-prolific package author, the creator of knitr and responsible for the growth of literate programming in R (specifically, R Markdown).

Coding

While taking courses and reading books do help people learn R, I see these as developing a programming foundation. The only way to learn to program is to write code. My understanding of R did not take off until I actually had to use R on my own for both academic and “real-world” projects. The more you code, the more you encounter problems… and solve them, one way or another. Coding is a skill like any other. No one is born being good at math, despite popular myth. The same is true for coding, or any skill. It’s always painful starting out, but that’s true for any worthwhile skill; the only way to learn is to practice. And in this era of increasing mechanization, coding skills are becoming even more valuable. (And having R skills pays well.)

When you code, you learn what you need to learn. These range from basic skills (writing functions, looping, creating graphics) to the use of packages particularly useful for your application. You also develop programming style, which is how to write source code and documentation in a way that allows people to understand what is being done (yes, programming involves style, just like in English class). You learn how much commenting is too much (or how much is not enough).

If you are a student, use R for your projects. Otherwise, get an idea for a project and use R to complete it. Perhaps there is some data set you are curious about, or you want to develop a predictive model, or maybe even scrape a web page. Maybe think of the most arduous task in your job and think about how you could use R to pass off the work to a computer. Just find something to do, and do it. This will reinforce your skills.

Documentation

You should expect to encounter something new when coding that you have never done before. For example, you will likely use some package that I have not covered. In April, there were over 8,000 packages published on CRAN, and that number will only grow as R grows in popularity. The only way to make sense of any of them is to read their documentation.

Fortunately, I usually find the documentation for a package to be very helpful, and in RStudio, it’s extremely easy to pull up the documentation for any function you are unfamiliar with; just type ?newFunc or help("newFunc") in the console (where newFunc is the object you want to look up documentation for), and the documentation will be pulled up in a side window. If the documentation is especially good, it may include examples towards the end.

Package authors often don’t stop at just documenting functions. They may write vignettes that give more detail about the package’s purpose and common use, with examples or theory. A lot of packages are published with a journal article in the Journal of Statistical Software (J.Stat.Soft) or the R Journal, both of which are peer-reviewed journals dedicated to publishing about statistical software packages (the latter for R in particular, though most articles in J.Stat.Soft are about R packages). These articles often include a lot of information about the package. Some packages even have books devoted to them!

Some packages include demonstrations, or demos, that you can access to see common usage. To see all available demos, type demo() in the console; this will list all the demos in all packages loaded into the working environment. Then type demo("demo_name") to see the demo "demo_name". Try this out by typing demo("error.catching") in the console to see what a demo is like.

To find further documentation for a package, including vignettes and other information, try looking at the package’s page on CRAN; for example, here is the CRAN page for the package magrittr. On the CRAN page, you can find a basic description of the package, any vignettes, and the reference manual (which is a PDF file that holds essentially the same information as that found by looking up documentation from the command line, such as the usage of all functions included in the package). Third-party sites may also include documentation for particularly popular packages.

Internet

Many people today, after getting a basic understanding of a programming language, forego classes and books and rely on only two tools to learn how to program: Google and StackOverflow. The use of Google is obvious; if you don’t know how to use Google, you have bigger problems you must address before learning how to program with R. Jokes aside, though, usually you are not the first person to encounter a problem or need to accomplish a task, and a good Google search will help you solve a problem or direct you to the packages you need for a project to solve a certain problem.

Frequently, your Google search will direct you to StackOverflow, a website where programmers ask questions and other programmers answer them. Usually your question has already been asked, especially if you are new to programming, and you should thoroughly check to see if you can find the question already. In the rare chance where your question is genuinely new, feel free to post your question on StackExchange. Most of the questions I have posted there have been answered, and there is a good chance your question will be answered as well.

Blogs

If you want to stay on top of R news, learn tips and tricks, discover new techniques and new packages, and just stay in touch with the extensive R community in general, look no further than R-Bloggers. R-Bloggers is a blog aggregator; bloggers request to have their blog added to the site, and whenever they publish a new R-related post, it gets copied and posted to the site, where it is then stored and distributed. You can follow R-Bloggers on Facebook and Twitter, via RSS, or via e-mail.

I get an e-mail from R-Bloggers daily, and sometimes it includes fascinating articles that expose me to something I had not known before. I am aware of dplyr, magrittr, parallel programming, Hadley Wickham, bookdown, the tidyverse, and many other things thanks to R-Bloggers. You can also stay on top of industry trends, which is always an important survival skill for anything remotely related to computer science. Bloggers are a great source of tutorials, as well; they will include source code with all of their nifty analyses, code that you can learn from to see best practices and techniques.

I also invite you to follow my blog. While I do post about non-R topics (Python programming, game programming, economics, and politics, and maybe something more personal now and then), I do try to post about data analysis and R programming, and I include source code whenever I can. I am also a contributor to R-Bloggers, so if you subscribe to R-Bloggers, you will be subscribing to my R content as well.

R User Groups and Mailing Lists

I am not a member of an R user group or attended any meeting. (I don’t own a car and I don’t have a lot of money, so I’m not very mobile; besides, I probably would rather spend my time doing something else.) That said, many R users like to attend R user group meetings and conferences to connect with the community and learn more about R. There is an R user group at the University of Utah students here can join, and there are likely others nearby you can find. You can use this list to find a user group near you.

Some R users prefer the good-ol’-fashioned mailing list to stay on top of news and to communicate with other R users, perhaps to get help. CRAN has some official mailing lists you can consider subscribing to. The University of Utah Mathematics Department has its own mailing list as well, to which I am a subscriber. Sadly, few use this mailing list, but I hope that more students join the list so that they can keep in contact with one another not only to share news and ask for help but also build a professional community through which connections can be built (perhaps job tips).

I also would like to mention that the website R-users is a site where employers can seek out R programmers (there is also a feed for R-users on this blog). In future job searches, perhaps check there for R-related jobs.

Giving Back to the Community

R is open source software, and in that spirit, many contribute to R and its development. The language itself is free (in terms of both speech and beer), and the overwhelming majority of packages useful for application are free. Documentation is free, and thanks to websites such as StackOverflow, users can even get high quality assistance for free. Many learning resources are provided for free. While you can take advantage of how cheaply R can be used ($0 is really cheap), I hope you consider how you could possibly “pay back” the community that provides all of these excellent resources for you. (And while this could take the form of much-needed monetary support, that is not what I have in mind.)

Granted, this article is written for those with minimal R experience, and thus unlikely to believe they have anything valuable to contribute. However, not only can even R beginners “pay back” the community, if you use R enough in your career, you will eventually develop some level of expertise that can help others who are starting out just like you. In fact, another “beginner” may be more helpful to others than another “expert”, who may take some knowledge for granted.

Also keep in mind that when I say you should “give back” to the community, this is not only for the sake of the community; it is for your sake as well, and you yourself may benefit the most from your contributions.

First, helping others and sharing your code improves your own skills. One of the reasons why I am a good R programmer is because I teach the R lab. The process of preparing lectures and helping students on assignments hones my own skills and forces me to address issues I otherwise would take for granted. On top of this, you learn to write better code and functions when that code is being written with a user other than yourself in mind. Your style and documentation improve, and the function you write will likely be more useful when you need to think about how to make them “general”, or usable in a wide variety of situations.

Second, by contributing to the community, you begin to develop a reputation that can translate to a more successful career. You can become an authority on a topic that others look to for help and thus develop a personal brand. If the resources you provide (be it a package, a blog post, or even answers to StackOverflow questions) become popular, you can develop an audience that you can then exploit for profit or leverage in job interviews. Employers can see your writing and code samples, and if those samples are good, you may be more likely to get the job you want.

So with that in mind, here are some ways to “give back”:

GitHub

Many programmers have accounts on GitHub, a software repository site that also serves as a social network for programmers. On GitHub, developers store and share their code, and others can download it, report issues, or even issue a pull request, which is a developer’s own modification to the code. By hosting code on GitHub, you may get more eyes looking at and using your code.

In your line of work, you may find yourself writing functions useful to the applications you are working on. If these functions are generally useful and not already in existence, you may want to consider bundling your code together into a package, which you can then host on GitHub or perhaps consider submitting to CRAN. People do benefit from being the authors of popular, useful packages. Ari Lamstein, the author of the mapping package choroplethr, reported that his package has made him money in books and consulting.

Even if you don’t have a package for some particular application in mind, perhaps consider organizing functions you use commonly into a personal package for your own use, and sharing this personal library on GitHub (with the disclaimer that you may change the contents of the package at whim and others should use the package at their own risk). This may encourage you to write better functions and classes, document your work well, all while keeping it together in a single place.

Additionally, if you find bugs or weak features/functionality in others’ packages, don’t hesitate to report them. Perhaps you can even read the source code yourself and add the fix; the authors and the community would certainly appreciate the help. (But be sure you know what you are doing.) This holds for packages hosted on CRAN as well, though the process for GitHub packages is simpler.

StackOverflow

It doesn’t take much to ask questions on StackOverflow, while answering them takes more effort. Answering questions, though, may benefit your own career. The website does a good job of tracking users contributions and scoring their usefulness, and this may make for a nice line on a resume. Selfish motivations aside, though, I believe that if you are going to ask questions, you should try to answer others’ questions when you can, since others have donated their time to help you (and it’s not easy or quick to write good answers).

Writing/Blogging

In my career, I have found that nothing teaches me more than being forced to write about a topic. Yihui Xie has written a few books on R topics (usually related to his packages), and in the announcement blog post on RStudio’s website, he had the following to say about writing books:

Writing books can be highly addictive: it helps you organize your (random) thoughts and content into chapters and sections, and it is very rewarding to see the number of pages grow each day like a little baby. You can do things that you normally cannot/won’t do in journal papers. … Choose a fresh and crispy font, and you simply cannot stop writing!

I too have found that writing forces me to process topics more deeply than when I don’t write, and I learn a lot just from the act of writing. (Again, this is one reason why I like teaching.)

You are never too inexperienced to write a blog. In fact, beginners are often great authors since they take less for granted and can explain things very clearly once they gain an understanding of a process. They also are better at demonstrating the process that leads to a result. Other beginners would greatly value your contributions.

It does not take too much to develop a meager audience, and a few great blog posts can earn you a reasonable baseline readership. I don’t promote my blog beyond announcing new posts on Reddit and other social networking sites, and having my posts distributed by R-Bloggers. While I have not had any of my R posts “take off”, my series on finance data using Python does earn my blog at least a few hundred views daily and is a popular introductory guide to the topic; considering my goals, that’s not bad. I get e-mails from readers on a regular basis asking for insight or proposing opportunities. That said, don’t be disappointed if your blog does not have a huge following (unless you’re trying to make a living blogging); those who found your insights useful will appreciate your efforts, and again, the one who learned the most from your article is you.

As you develop more expertise and an audience, you may eventually want to consider writing a book, as Yihui Xie suggests. Writing book-length material is not as difficult as you may think, especially when you have a book-length’s worth of material to write about. Even if you don’t sell your book, it can still be a useful part of your portfolio.

Conclusion

As I said earlier, this blog post represents my own experience with learning R, and it also reflects my own learning style. I also am not that experienced with R; I’ve been using it since 2012, while others have likely used it much more than I have, and I would invite their thoughts in the comments. That said, I believe that the suggestions made here may help not those who want to learn R but those who want to learn programming in general.

If I would offer one final tip, though, it would be to never grow complacent. The world is a rapidly changing place, and that’s especially true in the industries where R is commonly used. You should be prepared to always be learning in order to not just stay on top, but stay employed! Being able to learn independently may be the most valuable skill to have in the machine age, so I suggest you start practicing now.



To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Show more