Sharpsightlabs.com

How to really do an analysis in R (part 1, data manipulation)

2017-01-18

For a couple of years, I’ve been writing about the importance of data analysis, saying that data analysis is essentially the foundation of data science itself.

While I admit that there’s room for argument (and I admit that the reality is a little more nuanced) I still firmly believe that data analysis is the true foundation of practical data science.

That is, if you want to learn practical data science (as opposed to theoretical, academic topics) you need to be able to absolutely shred a dataset.

If we drill into that just a little more, what that means is that you need to be able to take a dataset, wrangle it into shape, and use visualization and other techniques to find insights.

This is essentially why I’ve been consistently recommending that you start learning data science by getting a solid foundation in two core skill areas:

Data wrangling (AKA, data manipulation)

Data visualization

Once you learn these skills independently, you can use them together for analysis.

In practical terms, this means that as an R user, you should know
ggplot2
and
dplyr
at a minimum. I also recommend learning a few other parts of the tidyverse, like
tidyr
, and
lubridate
.

What does an analysis look like in R?

I’ve been emphasizing the importance of visualization, wrangling, and analysis for a while, but you might be wondering, “what does it look like, in practice?”

I’m glad you asked.

As it turns out, last week I started asking questions about economic shifts in the world. In particular, I was asking questions about shifts in geopolitical and geoeconomic centers of power, and the rise of Asia. After some clicking around, I stumbled on a Wikipedia dataset about shipping volumes at the “busiest ports in the world.”

Aaaand, down the rabbit hole I went.

I decided to analyze the dataset, partially to satisfy my curiosity, but also to give you an example of what’s possible once you know a few critical skills of the tidyverse.

A quick note for beginners

If you’re a beginner, you should know that the code you’re about to see is “intermediate level” code. At it’s core, the code is pretty straightforward (really, I swear), but there are lots of little details that obscure the underlying simplicity.

Having said that, in an upcoming post I’m going to strip all of this down to bare bones to show you that you can learn 80% of this very quickly. If you want me to notify you when that post is available, definitely sign up for the email list … if you’re part of the Sharp Sight email list, you’ll be notified when I publish that future post.

In the meantime, let’s dive in. I want you to see just how powerful the tidyverse is for wrangling, visualizing, and analyzing data.

First things first: we need to get the data and hammer it into shape.

Get the data

First, we need to get the data. To do this, we’re going to scrape the data from Wikipedia by using the
rvest
package.

Admittedly, web scraping is not a topic for beginners. If you’re a beginner, you could just copy and paste the data into Excel, save it as a csv, and import it using the
readr
package.

However, I think this is a more interesting way to get the data (and it’s something that can be easily communicated/replicated through a blog post).

Ok, we’ll use
read_html()
,
html_nodes()
, and
html_table()
from
rvest
to get the data and parse it into a dataframe.

(As I noted, this is sort of an intermediate topic, so I’ll do a more in-depth tutorial on scraping in the future.)

Inspect the data

After scraping and reading in the data, we’ll examine the data using
glimpse()
.

This will give us the names of the columns, the dimensions of the data, and will print out the first several observations.

You’ll see this sort of data inspection after many of the procedures that we execute. It’s important to routinely examine your data after you change it. Essentially, as you execute a procedure, you want to make sure that the procedure did what you wanted it to. You want to make sure that everything looks “OK.”

Get into the habit of examining your data after important blocks of data wrangling code.

Notice that we have a the following variables:

Rank: lists the shipping-volume rank for 2014

Port: the port name (i.e., the city/location of the port)

Economy: the country where the port is located

2014[], 2013[], etc: These year columns contain the shipping volume for that given year

Again, the years themselves are spread across columns (as column names), and the shipping volumes (which we’ll want to eventually analyze) are listed under those year columns. This is not the correct, ‘tidy’ structure of the data, so we’ll eventually need to reshape it.

Also notice that almost everything has been read in as character data. We’ll need to change several of these variables to factors, numerics, etc.

Rename variables

Next, we’re going to rename our variables by transforming them all to lower case.

By default, when we read in the data, several of the variables have capitalized first letters.

In programming, there are several different style conventions (and I’ll admit that my naming conventions are occasionally unconventional), but many people will tell you that there’s no good reason to have a capitalized first letter for a variable name. In most situations, you’ll want to remove leading capitals, if for no other reason than it makes them easier to type.

To execute this, we’re going to use
colnames()
and
tolower()
.

On the right-hand side of the expression,
colnames()
lists the column names and
tolower()
transforms them to lower case. Here, we’re using the pipe operator (
%>%
) to “pipe” the names into
tolower()
.

We then use the assignment operator (
) so the new lower-case names serve as the input of
colnames(df.world_ports)
on the left-hand side. Remember that
colnames()
can both retrieve column names and set column names.

So ultimately, on the right-hand side of the expression we’re retrieving and transforming to lower case. Then on the left-hand side of the expression we’re re-setting the column names.

Get geo-spatial information (latitude/longitude)

Next, we’re going to “geocode” our observations so we can use a few geo-spatial visualization techniques later in our analysis. That is, we’re going to get the latitude and longitude associated with our observations.

More specifically, each row of our data frame (as it currently stands) represents a port. It is a specific location. Most frequently, it’s just the name of a city, but sometimes there is a specific name of a port.

We’re going to use
ggmaps::geocode()
to retrieve the latitude and longitude of these locations.

Working with geo-spatial data is a slightly more advanced topic, but
geocode()
is fairly straight-forward. You give it a place name, and it returns lat/lon coordinates. In this case, we’re going to feed it the names of the ports that are contained in the variable
df.world_ports$port
.

We then need to merge the lat/lon data back onto
df.world_ports
. To do this, we’ll use
cbind()
.

Manually code missing latitude/longitude data

When we used
geocode()
, it correctly retrieved the lat/lon data for most of our observations.

But, the function failed for a few of the ports. Essentially,
geocode()
couldn’t find those locations (or possibly found several matching locations and didn’t know which to select). Ultimately, for these observations, the function failed to retrieve the lat/lon data.

Consequently, we need to get that data another way.

To get the lat/lon data, we’re going to manually retrieve this information at http://www.latlong.net/convert-address-to-lat-long.html. If you go there, you’ll find that you can simply click on a location on the map, and it will return the latitude and longitude.

After manually obtaining the remaining lat/long coordinates (they are recorded in the commented code below) we’ll just hand-code them into our data frame.

This is probably a slightly inelegant solution, but it will work for the time being (and it’s simple enough that I can communicate it to you in a blog post).

To be honest, this block of code may be a little challenging to understand because of how we’re referencing the data frame inside the
case_when()
statements.

Having said that, at the core, this code block us just using
dplyr::mutate()
to modify our
lat
and
lon
variables. We’re then using
case_when()
to code those variables for specific ports. So for example, when the port name is “Valencia,” set the
lon
variable to
39.46991
. Etcetera. That’s all this is doing.

Convert variables to factors

There are several variables that were parsed as character variables that we’d rather have as factors.

This being the case, we’re going to convert these variables to factors. Note that we’re ultimately using the
dplyr
verb
mutate
to access the data frame and modify the variables. You’ll probably recall from past tutorials that
mutate
allows us to “mutate” or “change” variables that are contained inside of a data frame. Again, in this case, we’re using it to allow us to modify our data frame and convert
economy
and
port
into factors.

Create a ‘label’ variable

Next, we’re going to create an extra variable that we’ll essentially use only for labels when we create plots.

The reason for this, is that several of the port names are simply too long for some of our plots. When we begin to visualize the data (in the next blog post), you’ll see that some of the port names are so long, that they would get cut off if we used them.

That being the case, we’re going to create a new variable which will still contain the port names, but will use abbreviated names for the long-named locations.

To create this, we’re simply going to identify the names that are excessively long and create shortened names.

To identify the port names that are too long, we’re going to list them out using
levels(df.world_ports$port)
. When we do this, we can visually identify names that are long.

After identifying long names, we’re using
fct_recode()
to recode those values into shorter versions. Notice that we’re actually creating a new variable here, named
port_label
. This is because we want to keep the full names intact and use them when we can; there is more information in the full names, which we don’t want to loose. So in this case, we’re just making a new secondary variable to use in special cases.

Note as well that all of this is occurring inside of
dplyr::mutate()
. As I already noted,
mutate()
allows us to modify a data frame and create (or change) variables.

Keep in mind that this process was actually iterative. I had to recode these names and then use them in test-plots (like bar charts and small multiple plots) to see if the names were truly short enough, and to make sure that I got them all.

I’ve emphasized the importance of iterative workflow in the past, so I want to point out that it was actively used here, even if you can’t directly see that process in the finalized code.

Reshape data into long format (AKA ‘tidy format’)

Now, we’re going to put the data into “tidy” form.

What is tidy form?

It’s best explained with an example. If you print out the data using the
head()
function, you’ll notice that each year is actually in its own column. This is ‘untidy.’

The ‘tidy’ form of the data would have the years in a ‘
year
‘ variable.

Essentially, this means that we need to reshape the data into a new format, transposing the columns names (‘
2004[11]
‘, ‘
2005[10]
‘, etc) into values of a single new column.

As we do this, we will also be creating a new variable called ‘volume.’ This will contain the shipping volumes that had been contained underneath our set of old year columns (i.e.
2004[]
).

To do this, we need to use
gather()
from the
tidyr
package.

(Note that if data reshaping seems a little complicated, don’t worry.
tidyr
can be a little hard to wrap your mind around. I’ll write a separate tutorial about it in the future.)

Notice again that after performing this major transformation of the data, we’re inspecting the data. Specifically we’re using
head()
to print out a few rows, and then we’re using
levels()
several times to inspect the levels of our factor variables. Specifically, you want to look at the output of
levels(as.factor(df.world_ports$year))
to see that the values of our new ‘
year
‘ variable are actually the names of the years. (They are.)

Convert
year
to factor

When we reshaped our data and subsequently created the
year
variable,
year
was actually created as a character variable.

We need to convert it to a factor variable. This is a straight forward use of
as.factor()
inside of
mutate()
.

Rename the levels of
year

Now that we have our years in a single column (instead of spread across multiple columns), we’re going to recode the values of those years.

This is because when we scraped the data from Wikipedia and parsed it into our dataframe, it retained some artifacts that we don’t want.

Specifically, there are numbers inside of brackets that we want to remove. For example, if you examine the levels of
year
, you’ll notice the value
2004[11]
. We want to remove the
[11]
from this value and the similar suffixes on the other year values.

To do this, we’ll again use
fct_recode()
inside of
mutate
.

There are more elegant ways to do this, but this is a simple method that’s relatively easy to comprehend.

Change ‘
volume
‘ variable to numeric

Now we need to modify the
volume
variable that we created when we reshaped our data using
gather()
.

When we initially scraped the data, the shipping volume data was read in as character data because it contained commas. But later, to analyze these data, we need them to be numerics.

To convert from character to numeric, we first need to strip out the commas.

So, we’re going to use
str_replace()
from the
stringr
package to strip out the commas, and then cast the result as a numeric using
as.numeric()
.

Once again, we’re executing this inside of
mutate
, because mutate allows us to modify variables when they are contained inside of a data frame.

Calculate shipping-volume rankings, by year

Ok, our numeric shipping volume data is now reformatted into a form that’s ready for analysis.

When we analyze our data though, it would be good to be able to rank the data by shipping volume for every year.

For example, we’ll want to answer the question, “for 2014, which port had the highest volume? the second highest? third, fourth, …”

We’ll not only want to answer that question for 2014, but we’ll want to know the rankings for each individual year. This will allow us to examine the rankings for any particular year, but also examine changes in rankings year-by-year.

Again, to do this, we need to calculate a rank.

Note that the data set already has a “
rank
” variable, but this actually contains the ranks for 2014. Basically, when we used
gather()
to reshape our data, the 2014 rankings were repeated for each and every year. This is incorrect, so we need to drop the existing
rank
variable and then re-calculate.

To drop the existing
rank
, we’ll just use
select()
from
dplyr
. Syntactically, we’re using the ‘minus’ sign to drop the variable by typing “
-rank
“. This drops the variable.

After that, we’re calculating a new
rank
variable by using a chain of
dplyr
verbs.

You’ll notice that we’re taking the data frame
df.world_ports
, then grouping by
year
(because we want to establish rankings within years). Then we’re using
mutate()
to modify the data frame and create a new variable called
rank
.

Notice that inside of
mutate()
, we’re setting the value of
rank
to
min_rank(desc(volume))
. Essentially, this code ranks our observations in descending order, by volume (i.e., the largest volume receives a rank of 1, the next largest gets the rank 2, etc). But, because we used
group_by()
, it will calculate the ranks within a year.

This will give us a ranking from top to bottom of the busiest ports, by year.

Create ‘
continent
‘ variable

Now, we’re going to create a new variable called
continent
.

This is pretty straight forward. It’s just the continent that each port is on (i.e., North America, Europe, Asia, etc).

We’ll use this variable to analyze our data by continent (this will be a good variable upon which to group, subset, etc).

To do this, we’ll first list the countries, by executing
levels(df.world_ports$economy)
. We’ll then use that list and code a new factor variable on the basis of those countries. To do this, we’ll use
fct_collapse()
from the
forcats
package.

fct_collapse()
allows us to recode a factor variable and ‘collapse’ its values into more general factor groupings.

When we do this, notice that once again our creation of this new variable is occurring inside of a call to
dplyr::mutate()
.

Reorder the variables

Ok, it’s the home stretch. We’re almost done wrangling our data.

One final thing we’ll do is re-order the variables.

They’re currently in an un-intuitive order, so we’ll reorganize them such that
rank
and
year
come first, followed by location information of increasing specificity, until we finally end the data frame (on the far right) with the shipping volume metrics.

To do this, we’ll first get all of the variable names using
names()
.

Then we’ll use those names in a call to the
dplyr
verb
select()
. The
select()
verb is typically used to “select” columns (i.e., subset your data down to a specific set of columns that you list). However, in this case, we’re going to “select” all of our columns. We’re just going to select them in a new order.

Notice again that we’re doing a few small check and some data inspection.

For example, before re-ordering the variables, I used
ncol()
to count the number of columns. Then after re-ordering them, I checked the number of columns again, just to make sure that they are all still there. It might seem unnecessary, but these minor checks are good ways to ensure that you’ve executed your data transformations properly. Checking your work can save you a lot of headaches, so do it, even if it seems “too basic.”

Spot check a few values

Finally, we’re going to spot check a few values to make sure that they are correct.

We’ve performed quite a few transformations on the data, so it’s entirely possible that something went wrong. We need to make sure that the values are the same as the values in the original Wikipedia table.

To do this, we’ll simply subset down to an exact value. Then we’ll compare the value provided by our data to the value on Wikipedia.

To get specific values out of our data frame, we’re going to use
filter()
and
select()
.

filter()
will allow us to filter our data down to an exact row. In this case, we’re going to filter down to a specific port for a specific year.

We’ll then use
select()
to select a specific column. In this case, we’ll select
volume
, which will give us the shipping volume for that port/year combo.

They’re correct. We could check more, but these look good.

You’ll also be able to see in the next blog post, what the data look like when we visualize them. Aside from using visualizations for analysis we’ll also be able to examine them to see if anything looks incorrect.

Want to see part 2?

In the next blog post of this data analysis series, we’re going to use visualizations to analyze the data and tell a story.

To see part 2, SIGN UP for the email list.

When the next blog post is published, you’ll be notified immediately by email.

The post How to really do an analysis in R (part 1, data manipulation) appeared first on SHARP SIGHT LABS.