2016-12-27

In last week’s blog post I asked How much data science do you actually remember?

It’s a critical question. If you study data science, but forget everything that you learn, you’ll be in big trouble when you go in for an interview. Or, you’ll be in big trouble if you actually get a data science job, but you’ve forgotten the essential skills.

Let me be very clear: you need to know your essential toolkit inside and out. You need to remember your tools, and you need to be able to execute quickly and on command if you want to be a top performer.

However, the hard truth that you actually need to “memorize some syntax” stirred up several comments.

One comment stood out, because it raised a critical question: why memorize your toolkit, if tools become obsolete.

Every tool has a shelf life, so why master your tools?

Here’s the comment in question:

Thanks for posting this article. I enjoyed reading it, and I found it thought-provoking. I actually disagree in some ways. Yes, you need to be fluent in data analysis to the point where you know what your strategy is going to be and what tools you will need to execute on it. But I don’t think it’s a good use of time to memorize, eg, the exact syntax of anything but your most bread-and-butter tools.

As someone who slogged through a quantitative PhD after years in data engineering, this might sound like blasphemy. But my experience has been that every tool has a shelf life. Every single one. We all used Perl in the early 2000s. A couple years ago, dplyr was nowhere to be found. Even just a few months ago, you used lapply where today you might use purrr.

Any package you are using, no matter how essential or basic it seems now, is going to get replaced with something easier, more elegant, higher-level. Even the way you think and talk about data analysis is going to evolve. What is not going to change is the need for a cold, clear-eyed way of looking at a problem and building a game plan on the fly.

I actually agree with this comment in a few ways (which I’ll address momentarily) but let’s first understand the essence of his objection.

The heart of his critique is this: data science is changing very fast, and any tool that you learn will eventually become obsolete.

This is absolutely true.

Every tool has a shelf life.

Every. single. one.

Moreover, it’s possible that tools are going to become obsolete more rapidly than in the past, because the world has just entered a period of rapid technological change. We can’t be certain, but if we’re in a period of rapid technological change, it seems plausible that toolset-changes will become more frequent.

So, I agree with David’s comment: tools become obsolete. Although R is very popular today (and increasing in popularity over the last few years) another language might become more popular for data science. We don’t know.

What I do know, is that if you want to learn data science today, you need to select a tool and master the basics.

With that in mind, I want to clarify a few points to make sure that you understand exactly what you need to do as you get started learning (and mastering) data science.

Mastering the foundations doesn’t take long

(if you know how to practice)

As a beginner, you need to master the foundations.

Keep in mind that I’m not telling you to spend the next 3 years memorizing every part of the R programming language.

But, you need to know the fundamentals backwards and forwards.

A little more specifically, you need to master the foundational skills of at least two core skill areas: data visualization and data manipulation. You also need to be able to use these tools together to analyze data.

As an R user, that means you should know the most common tools from base R, and the most commonly used techniques from a few packages:


ggplot2


dplyr


tidyr


readr

Keep in mind that these are general recommendations that will apply to about 80% of people. If you’re in a specialized industry, the advice may be slightly different. For most people, however, these are the tools that you need to know.

It is absolutely possible to master these foundations within 2 or 3 months (maybe faster if you put in more time. The secret is systematic practice. If you know how to practice, you can master R’s essential toolkit very, very quickly.

If it only takes a couple of months (and has huge payoffs), why wouldn’t you master the essential toolkit?

To clarify: you need to learn the high frequency tools

Pay careful attention to exactly what I just wrote:

I said that you need to master the “essential toolkit.” Master the “most commonly used” techniques from the packages I recommend.

Do you need to learn every single function from those packages?

No.

Do you need to learn all of the parameters of every function.

No.

Do you need to memorize every little detail?

No.

… but, you need to memorize the most commonly used tools.

Here’s a quick example:

To be productive as a data scientist, you need to know at least a handful of essential data visualizations. You’ll use these over and over again in reporting, analysis, visual communication, and exploratory data analysis (e.g., using EDA as a step in your machine learning workflow).

You need to know:

– The histogram

– The scatterplot

– The bar chart

– The line chart

Learning these tools is like learning the basic vocabulary of a foreign language. They are essential. They are your “essential data visualization vocabulary.”

They’re essential because you’ll use them constantly. They are the “high frequency” data visualizations that you’ll use vastly more often than other, more obscure visualization techniques.

Because they are so common and so essential, you need to know them backwards and forwards. If I were to tell you to “create a scatterplot of X vs Y using
ggplot2
” you should be able to do that on command, from memory.

Trying to analyze a dataset without fluency in these techniques is like trying to speak French without knowing any French vocabulary. It won’t work.

I can’t emphasize this enough: there are foundational skills that you need to know. They are the essentials.

You need the foundations to be productive and get hired

Now that I’ve clarified that you need to master the foundations (but not necessarily every syntactical detail) let’s talk more about the core point of David’s comment: tools become obsolete.

This is true. In a few years, another data science language might overtake R as the top-of-the-line data language. We just don’t know.

What I do know, is that if you haven’t mastered the foundations of a data science language today, you won’t be productive today.

Quite frankly, even though it’s absolutely true that tools become obsolete, that doesn’t change the fact that you need to be skilled in a toolset today in order to get hired and be productive today.

If you want to get a job as a data scientist, you need to have a high-level of competence in data visualization and data manipulation (at minimum).

And getting hired isn’t the only reason to master them. You also need a high level of competence in these skill areas to “get things done” once you get a job. If you don’t know them, you will be unproductive.

The point is, even if R might become obsolete at some point in the future, you still need to get hired and get things done today. There’s no way around it. You need to know your tools.

You can use the foundations as a platform to learn higher-order concepts

Even if R becomes obsolete in the long run, mastering the syntax of essential data science tools in R has another advantage: it serves as a foundation for learning higher-order, language-agnostic concepts and processes.

Ultimately, after you learn the syntax of the essential tools, your next step should be to learn these higher-order concepts and processes.

Let me give you an example. I just mentioned that you need to know several essential data visualizations: the scatter, the line, the bar, and the histogram. You should know the syntax for these cold. But you also need to know how to apply them.

I’ll re-emphasize that you can’t apply a tool that you don’t know. If you can’t create a scatterplot, you can’t apply it as a tool to analyze data. So once again, mastering the syntax of your foundational tools is necessary.

Having said that, once you really know the syntax for creating these basic plots, you need to know how to use them as analytical tools. When do you use the scatterplot? What is a bar chart good for? What are the limitations of the histogram? What do you do if you encounter overplotting in a scatterplot? How can you combine techniques to create effective multivariate visualizations? These are some of the things you need to learn after learning basic syntax. You need to understand concepts, processes, and application.

This knowledge about process and application is largely language-agnostic. Once you learn how to use visualizations to find insights in R, you’ll be able to use those visualizations to find insights in another programming language. The knowledge about how to use visualization to find insights is a transferable skill.

There are other language agnostic skills that you must learn after learning essential syntax. For example, you need to learn basic data visualization workflow; that is, you need to know how to iteratively create a data visualization, starting with a simple version, and then adding details to create a more “polished” chart.

I also recommend that you use your foundation in R essentials to learn mathematical and statistical concepts. I’ve written about this elsewhere, but I think that beginners place too much emphasis on abstract math in the beginning. My common recommendation is to learn how to analyze data first (using
ggplot2
+
dplyr
). After that, you might even learn a little bit of practical machine learning (not theoretical ML). Then, once you’ve learned some syntax and you’ve learned how to apply basic tools, you can “back into” higher-order mathematical concepts.

Again, all of these higher-order skills cut across programming languages. How to think about data visualization is not R-specific. How to think about data analysis is not R-specific. These are meta-skills that are apart from the programming language itself.

What that means, is that when you begin learning “how to think about data visualization” and “how to think about data analysis” in R, you’ll later be able to take what you learn and apply it if you move on to another programming language.

Learning a programming language can help you

“learn how to learn”

And as David’s comment pointed out, it’s very likely that you will need to learn a new programming language within a few years.

The tech world is changing very fast. Any language that you learn today is likely to become obsolete. You’ll eventually need to learn another language and toolkit.

What this means, is if you can rapidly master a new programming languages, you’ll be able to stay at the cutting edge. This will make you very valuable. I actually think that being skilled at learning itself will be one of the most valuable skills of the next couple of decades.

“The illiterate of the 21st century will not be those who cannot read and write, but those who cannot learn, unlearn, and relearn. ”

– Alvin Toffler

I said it in my post last week, and I’ll say it again:

You need to learn how to learn.

If you want to be an elite performer in the next few decades, you need to learn how to learn.

And to be clear: learning is a technical skill. You can get better at learning.

And actually, learning a new programming language is an excellent way to get better at the skill of learning (if you’re highly systematic).

Do you want to be a highly valued “superlearner?”

Learn R, and be extremely systematic about how you do it.

To understand how learning a programming language can help you improve at the metaskill of learning, let’s examine a few people who have mastered “how to learn” spoken languages. (Learning a programming language is very similar to learning a spoken language.)

“Polyglots” offer some clues on how to become more effective at learning.

What you can learn from foreign language “polyglots”

Many noteworthy polyglots are highly systematic in how they learn. They aren’t geniuses by birth, they’ve just learned how to learn. They are highly skilled learners.

Many of them have turned learning a new language into a process.

Here’s a look at several things they do:

You’ll learn to focus on foundations

Several famous polyglots and superlearners insist that you need to focus on foundations first. (Does that sound familiar? I’ve been hammering that point for months.)

For example, Tim Ferriss attacks a new language by identifying the highest frequency words in that language. This is an application of the “80/20” rule. He finds word frequency lists, and identifies the most frequently used words that yield the highest return on investment. He finds those words and memorizes them. He practices those high-frequency words over and over (commonly with flash cards).

This is very similar to the system suggested by the polyglot Gabe Weiner in his terrific book Fluent Forever. His system is essentially as follows: identify the highest frequency words, and practice those words until you know them cold. Essentially, he’s applying a principle similar to Ferriss’ common 80/20 analysis. Find the most frequent words and learn those first. (I’m talking about words like “cat”, “dog”, “car”, “sit”, “walk”, “eat”.)

There’s a reason why this works. In most spoken languages, a vocabulary of the most frequent 1000 words “covers” about 75% of the spoken language. That is, if you know the most frequent 1000 words, you’ll be able to understand about 75% of the language.

So, these superlearners understand that to rapidly learn a spoken language, they need to identify the highest frequency words and master them.

We can use a similar principle when we learn R: find the most used, most important techniques, and master them.

You’ll learn how to practice

I’ve said it recently, and I’ll repeat it: if you want to rapidly master R, you need to practice.

Polyglots know this too. To master a spoken language quickly, you need to practice. Most polyglots start out by drilling words until they’ve learned basic vocabulary. Then they move on to practicing longer phrases.

Later, they move on to application and working “in the real world” by having conversations.

If you want to master R quickly, you need to practice. And you need to develop systems for practicing effectively.

Again, these “systems” for practicing learning a programming language are transferable. After you have a system for learning R, you can then apply a similar system to learning a new language in the future.

You’ll get better at learning programming languages

Ultimately, once they develop systems and “get better at learning” many polyglots will tell you that learning new languages becomes easier over time. Learning a second foreign language is easier than learning a first foreign language. The one after that is easier still.

As a data science student who’s learning R, the takeaway here is that if you can become systematic in how you learn R, you can become better at learning programing languages.

So if R does become obsolete, you’ll be prepared.

TL;DR: Here’s my recommendation

Technology is changing fast.

Any programming language you learn does have a shelf life.

But don’t use that as a reason to not master the foundations of R.

Instead do the following:

Master the foundations of R

This means master the essential tools of data visualization, data manipulation, and data analysis in R. Drill the syntax of these foundations until you know them with your eyes closed.

Use your language as a platform to learn principles

Once you’ve nailed the syntax, use your language as a “platform” to learn principles. Begin to focus on “how to analyze data”, “how to think about data visualization”, “how to find insights.” Essentially, you want to begin learning concepts, workflow, and process. These skills are language-agnostic, so you can bring them with you if you move to another language.

Master the art of learning

Tools become obsolete. Over the course of your career, you’ll have to learn new things to stay competitive.

This is a reason to master the art of learning itself. Become systematic about how you learn. When you learn, focus on foundations, work on small problems first, then increase complexity and apply your skills on increasingly hard problems.

Ultimately, I’m telling you not to despair that programming languages become obsolete.

I want to to become so good at learning them that you just don’t care.

Discover how to master R and master the art of learning

Do you want to master the foundations of R?

Do you want to master data science?

Do you want to master the art of learning programming languages?

Sign up for the Sharp Sight email list.

At Sharp Sight, we’ll not only show you how to master R, but also show you strategies for learning fast.

The post Why you should master R (even if it might eventually become obsolete) appeared first on SHARP SIGHT LABS.

Show more