2017-01-24

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)

In several recent blog posts, I’ve emphasized the importance of data analysis.

My main point has been, that if you want to learn data science, you need to learn data analysis. Data analysis is the foundation of practical data science.

With that statement in mind, my most recent blog post showed you “part one” of an example analysis.

In that post, we kicked off the analysis by getting a dataset (from Wikipedia) and manipulating it into shape. It is a dataset about shipping volume at the world’s busiest ports, from 2004 to 2014.

Now that the dataset is ready, we’re going move into “part 2” of the analysis. We’ll analyze the data using a combination of tools from the
tidyverse
.

Essentially, I want to show you how to use tools from
ggplot2
and
dplyr
together to analyze data.

Before we start, keep in mind that what you’re going to see is “intermediate” level
ggplot2
/
dplyr
.

That said, even if you’re a beginner, you can look at the code and try to follow the general analysis process.

Morevoer, although I occasionally disparage copy-and-paste coding, if you’re a beginner, it will still be instructive to copy and run this code. You’ll still be able to take it apart and see how it works. Copy-and-paste won’t allow you to master
ggplot2
and
dplyr
(and memorize the syntax), but in this case it will help you understand.

On the other hand, if you’re not a beginner, and you’ve learned some
tidyverse
syntax, this will show you how to put the pieces together and get things done. If you’re serious about learning data science, study this post. There’s a lot here for you …

Ok. Let’s get into it.

Get the data

First, we need to retrieve the data.

We built the dataset using the code in the last blog post. We originally scraped it from Wikipedia, and hammered it into shape using
dplyr
and
tidyr
.

If you’ve learned a little
dplyr
and
tidyr
, I highly recommend that you review that blog post, and possibly run the code and create the dataset yourself. It will give you an integrated view of how to use data manipulation tools together to shape a dataset prior to analysis.

Having said that, if you don’t want to run the code and build the dataset yourself, you can get here by loading it from a URL:

Create ‘themes’ for plot formatting

Before we actually plot our data, we’ll create a few themes that we can apply to our plots.

If you’re not familiar with them, “themes” in
ggplot2
are just bundles of formatting code. In
ggplot
, we use the
theme()
function to format specific parts of our plot. We can then “bundle” those pieces of code together to create a reusable theme.

That’s what we’re doing here. We’re taking several lines of formatting code (that we execute with the
theme()
function) and bundling them together.

Notice that we’re actually creating a few themes here:

theme.porttheme
: this is just a general theme that we’ll apply to most of our plots in this analysis

theme.smallmult
: this theme will be applied to “small multiple” charts. Because small multiples can be quite large if you have a lot of panels, they can have special formatting requirements. We’re setting up that formatting here.

theme.widebar
: This is for “wide” bar charts. You’ll see in a moment, we’re going to plot some bar charts that are quite wide, and again, these specific charts will have specific formatting requirements. We’re bundling that specific formatting code together here.

Keep in mind: most of these themes were built up iteratively. I mentioned iterative workflow in the last blog post (and in older blog posts) but I’ll call it out again.

I didn’t sit down and write these themes in one go. I built them up as I created new plots and tinkered with the formatting. In many cases, I tested the formatting code directly in the plot (not in the theme bundle). After I tested it and found that it worked, I added it to the theme.

Just keep that in mind …. when you create your own themes, build them up over time as you begin to understand the specific formatting requirements of your plots.

Get a high-level overview with a bar chart

First, we’re going to make a bar chart of shipping volume vs port for all of our ports, for 2014.

Ultimately, the rows of the data frame are unique by
port
and
year
, so this is a good starting point. It will show us the shipping volumes of all of the ports for our most recent year. After this high level overview, we can drill down by subsetting, plotting trends over time, etc.



First, let’s talk about how we’re using this. The chart shows all off the data for the most recent year, so it’s a good tool for getting a high level overview. It quickly gives us a view of how the data are distributed (highly skewed), and it also helps us identify the ports with the highest volume.

Although it has advantages as a “high level” overview, it also has limitations. First, because we have roughly 50 items on the x-axis, the labels must necessarily be relatively small in order to fit it all in. We also need to rotate the labels 90 degrees so the labels fit. This makes the plot hard to read.

This being the case, our next task will be to make a plot that’s easier to read and easier to interpret.

Flipped bar chart (for better interpretability)

We’re going to create a modified version of the bar chart for better interpretability and readability: the flipped bar chart.

A couple of quick notes about the code:

We’re subsetting using
filter
. Basically, we’re subsetting down to 2014 data for the top 25 ports. Why 25? Basically, there were too many bars plotted in our last chart, so I’m going to limit the data to the top 25. This causes us to lose some information, but it will make the plot more readable and it will make it easier for the chart to fit in a small display area. This is one of the sacrifices we make when we use a flipped bar chart. Better readability, but we sometimes need to plot fewer bars to make sure they fit.

To flip the bar chart, we’re using
coord_flip()
.



This chart is quite a bit easier to read.

We can read the port names (which are now on the y-axis) without turning our heads sideways. We’ve also added the shipping volume amounts at the end of the bars, which gives us more direct access to the numbers themselves. Adding those numeric labels over the bars wasn’t really possible in the previous bar chart, because the numbers would have been much too small, and they would have been flipped 90 degrees. That would have made the numeric labels effectively unreadable in the last chart.

Also notice the title. We’re using the title to call out important features of the data. This is how and where we begin to “tell a story” with our data. The titles are best used to tell a story. We can then use subtitles to give more context.

Additionally, I’m constructing these charts with the assumption that they will be seen together, in this particular sequence. (If we were constructing a report or analysis, we would order them in a particular sequence). Assuming this sequencing allows us, in many cases, to assume that the reader will understand the context of the charts. In turn, this allows us to omit some contextual details from the titles and subtitles.

Drill down: highlight ports in China

After looking at the first two charts, it appears that many of the busiest ports were in China.

Many readers may say “yeah, of course the busiest ports are in China,” based on the relatively common knowledge that China has a strong manufacturing base. That still doesn’t excuse us from looking at the data to confirm or dis-confirm what we think we know.

Moreover, by explicitly looking at the ports in China, we can make our knowledge (and intuitions) about Chinese shipping volume more precise.



First, let’s talk about the story that we can tell. At a quick glance, it’s clear that a large percent of the busiest ports are in China, because we’ve highlighted them in red. At a glance, this is the most obvious feature of the chart, so we’ve called it out with the title. (We quickly calculated “over 20%” based on the fact that there are ~50 ports in our chart of “busiest ports” for any given year).

Next, let’s talk about how we created the effect.

At the very top of the dplyr pipeline, we’re piping
df.world_ports
into
mutate()
. Inside of
mutate()
, we’re using an
ifelse()
statement to create a “flag.” Essentially, we’re creating a new (temporary) variable that we can use as we pipe this result into
ggplot
:

We then pipe that result into
ggplot()
, and there’s a piece of code inside of
geom_bar()
that allows us to highlight the Chinese ports:
aes(fill = china_flag)
. Essentially, we’re using the “flag” variable that we created in
mutate()
to specify how we want to color our bars. If the port is a Chinese port, it will get one color, and if it’s not a Chinese port, it will get another.

Where do we specify the colors? We specify which colors to use for the “China” bars and which to use for the “Not China” bars in the following line of code:
scale_fill_manual(values = c(“dark red”,”#999999″))
.

Drill down: highlight ports in Asia

Next, we’ll create a very similar bar chart that highlights the ports in Asia, as opposed to just China.

The reason for this, is that based on the previous bar charts, it appears that not only are a significant number of the ports in China, but also in Asia overall. We want to get a sense of the magnitude of this feature, so we’ll create a chart that highlights all of the Asian ports.

Technically, we’ll do this in a way that’s almost identical to the previous chart: we’ll create an indicator variable called
asia_flag
and ultimately pipe the result into
ggplot()
. Then we’re using our mapping to the
fill =
aesthetic and a subsequent call to
scale_fill_manual()
to complete the highlighting effect.

Small multiple chart: plot change in shipping volume over time, by port

Ok. We’ve just examined the data for the most recent year in the data (2014). We did this by creating a few bar charts, which gave us some insight into some hot-spots and important features of the data.

But the data aren’t restricted to 2014. We have data from 2004 to 2014. That being the case, we’ll want to look at some trends.

Because our data is broken out by port (i.e., a city/location), we’ll want to see the trends for all of our ports, if we can. This is a perfect use case for a small multiple chart (AKA, a trellis chart, a panel chart).

Essentially, we’ll create a mini line chart for all of our ports. Each line chart will show the trend in shipping volume over time.

To do this, we’re going to use
geom_line()
to indicate that we’re creating line charts, but then we’ll use
facet_wrap()
to ultimately break everything out into the small panels. If you’re not familiar with the small multiple design, I highly recommend you check out this blog post on faceting. It will give you some details about how it works in
ggplot2
.

I will admit, this chart has a lot of information, and at a quick glance it might overwhelm some viewers.

It’s best used as a high level overview for finding additional details to “drill into”. Given the information density, it requires and rewards a bit of careful study.

Notice as well some of the design features: the axis text is necessarily small. The panel labels (the names of the ports at the top of each panel) are also small. This makes the chart a little hard to read, and that’s a limitation. The tradeoff we make with the small multiple is that we get a large density of information, while sacrificing some readability. Again, the small multiple design rewards and requires careful study to overcome the readability challenges.

As a side note, I think that small multiples are one of the most powerful tools in an analyst’s toolbox, and they are wildly underused. They are extremely effective at creating high-level overviews that help you discover where to “drill in.”

Even though it’s rarely used, I can’t recommend it enough.

In fact, it’s so powerful that I strongly recommend that every beginner learn how to use
facet_wrap()
.

Plot Shanghai volume change over time (line chart)

After examining the preceding bar charts, it has become obvious that Shanghai and Singapore had very high shipping volumes in 2014. And after looking at the trends in the small multiple that we just created, it’s obvious that Shanghai and Singapore have seen substantial increases in shipping volume since 2004.

Because we’ve noticed these features in our overview plots, we now want to briefly focus in on Shanghai and Singapore.

Keep in mind, that we only know to focus on Shanghai, Singapore, and a few others, because we started with high level charts like the bar charts and small multiples, which allowed us to identify high performers.

The plot that we’ll use to focus in on these individual ports is the basic line chart. In the small multiple, it was clear that there was an upward trend from 2004 to 2014, but the panels were simply too small to get a clear picture. So, we’ll use a line chart to “zoom in” and highlight the information we want to see.

In many ways, this is a very simple application of a line chart. However, we’ll make it a little more visually interesting by plotting the Shanghai port volume trend (in red) but also plotting all of the other lines in a feint, transparent grey.

To obtain this effect, we’ll again use
mutate()
to add an indicator variable. That indicator variable is then used in some of our
ggplot
code.

Specifically, we’re going to map the indicator variable to both the
color
aesthetic and the
alpha
aesthetic.

Mapping to the
color
aesthetic will allow us to highlight Shanghai in a specific color. Mapping to the
alpha
aesthetic will allow us to fade the other lines into the background. This will reduce the complexity of the chart (there would be a lot of lines otherwise) while still giving faint context by having the “ghosted” lines in the background.

Note that syntactically, what ultimately completes the effect is the call to
scale_color_manual()
to specify the colors and the call to
scale_alpha_manual()
to specify the transparency of the different lines.

I won’t discuss the findings much here, but I will note that again, we’re using the title to “tell a story” about what we’ve found.

Plot Singapore volume change over time (line chart)

Because we’ve identified Singapore as another “hot spot” in our data, we’ll plot the change in volume for Singapore as well.

This chart is structurally similar to the one we just created for Shanghai, except I’m not plotting any of the other lines for context. Part of the reason, is to show you a version that’s a little simpler to create.

Notice that this is essentially just a simple line chart using
ggplot()
and
geom_line()
.

Tactically, we’re “zooming in” on Singapore by subsetting the data using
filter(port == “Singapore”)
.

One thing that I didn’t call out in these last two plots for Shanghai and Singapore, is the drop in 2009.

These drops make sense in the context of the 2008-2009 financial crises, and they may be something that we’d want to discuss more (or investigate more) depending on our goals.

That said, I’m going to leave them here without more comment or analysis. I’m pointing this out so you understand that sometimes, you need to make decisions about the story you communicate with your data. When you only have a brief headline (and your audience has a brief attention span) you need to make choices about what you communicate, and what you leave out.

Plot rank changes of all ports with small multiple

Next, we’ll plot all of the ranking changes for all of the ports.

You might recall that when we created this data set, we ranked all of the ports by shipping volume for every single year (2004 – 2014). What this means is that for any given year, we can identify which port had the highest shipping volume, which had the second highest, and so on.

To plot all of the ranking changes for all of the ports, we’ll use another small multiple chart.

We’re going to plot the rank of each port over time. This is essentially a line chart of
rank
vs
year
, faceted on the
port
variable.

This might be the most interesting chart so far. There are a lot of little stories here that we can identify, and new questions to ask.

You can see the rise of several Chinese ports (which we’ll address next) as well as the relative decline of a few ports in America (LA, Long Beach) and Japan (Yokohama).

The stories about China, America, and Japan only scratch the surface. There are other interesting rises and dips. There are many other ports we could investigate. However, in the interest of time, it’s difficult to call them all out. As a data scientist and analyst, you often need to choose what to focus on. That means excluding some things or setting some things aside for later (when you have more time/resources).

Plot rank changes for top Chinese ports with a bump chart

In the last high-level chart, we saw that many Chinese ports rose in rank from 2004 to 2014.

But again, the small multiple design that we’ve just used has limitations in that it’s difficult to read. The panels are so small, that it’s difficult to really see specific details. Moreover, because the small multiple broke each port out into separate panels, it’s difficult to see rank changes of ports against one another.

So, now that we’ve seen the data at a high level, we’ll once again focus in. To do this, we’re going to use a bump chart.

A bump chart is essentially a line chart, where rank is on the y-axis, and time is on the x-axis. Essentially, the bump chart allows us to track rank changes over time.

To limit the complexity of the chart (to stop it from getting messy, with too many lines), we’re going to limit this to the top 15 rankings. To do this, notice that I’ve created a parameter called
param.rank_n = 15
that we subsequently use in the code. This essentially allows us to specify how many ranks we want to plot on the y-axis. We’ve set it to “15” in order to plot the “top 15”.

We’re also going to highlight the Chinese ports. This is for 2 reasons:

it was clear in the last chart that Chinese ports rose in rank from 2004 to 2014

highlighting a few lines will reduce the visual complexity

To accomplish this, we’re doing a little bit of data manipulation: we’re again creating a flag for the Chinese ports, but we’re also creating a variable called
china_labels
.
china_labels
will enable us to individually color each line for the different Chinese ports (we do this in conjunction with
scale_color_manual()
).

We’re also going to modify the transparency of the lines. We’ll set the Chinese lines to almost fully opaque, and set the non-Chinese lines to be highly transparent. This causes the non-Chinese lines to fade into the background and causes the Chinese ranking-lines to pull into the foreground. Modifying the transparency of the lines is another technique for highlighting the Chinese rankings. This helps us visually tell a story that’s specifically about China. Moreover, making the “other” lines more transparent, helps to reduce the visual complexity of the chart.

Ultimately, bump charts have a tendency to become too visually complicated, so we’re mitigating that complexity by controlling transparency and color.

There’s a lot here, so I’ll leave it as an exercise for the reader to interpret this and call out interesting findings.

As a side note, this chart is somewhat difficult to create. I’ve already noted that this whole post is at an “intermediate” level, but the bump chart we just created borders on advanced. The real challenge in creating this bump chart properly, is that it requires you to properly set several “fine details.” You need to know how to add the labels in the right location. You need to know how to control the color and transparency (and get them “just right”). The techniques to accomplish these aren’t terribly complicated themselves, but you really need to know what you’re doing to make it all work just right.

Map the data

Alright. Enough with the boring bars and lines.

Let’s do something sexy and arguably less useful.

Let’s map some stuff.

Create basic dot distribution map

Because the data has a geospatial component (these are port locations, with associated latitude/longitude) we’re going to map this data.

First, we’ll just get the map itself using
map_data()
. This will give us a world map that has outlines of the world’s countries.

Next, we’ll do a simple “trial” by just plotting the port locations on the map.

We’ll plot the world map as one layer by using
geom_polygon()
.

On top of the map, we’ll plot the ports as points by using
geom_point()
.

This is the “layering” technique that I’ve discussed recently. We’re building this plot in layers: first the “polygons” (i.e., the country map), and second the points.

Create bubble distribution map

Finally, let’s make a polished version of the map, by transforming this into a “bubble distribution map.”

First, we’ll define a theme.

Now that we have our theme set, we’ll finally map the data.

Essentially, we’re going to use our previous dot distribution map as a foundation, and add another detail: we’re going to relate the size of the points to the “volume” variable in our data.

I want to point out a few things.

First, in spite of my joke about maps being “arguably less useful” we can get some information from this: at a glance, this map gives us a sense of the density of shipping volume in Asia vs everywhere else. And because it’s a map, it conforms more strongly to our spatial intuitions. That is, when we say “shipping volume is higher in Asia,” that’s something we can directly perceive on a map, whereas it’s something abstract when we mention it alongside a bar chart.

On the other hand, a map like this is not great for making precise distinctions. I’ll save my discussion of this for another blog post, but the human visual system isn’t very good at making precise distinctions between areas of different sizes. For example, looking at this map, how much bigger is Shanghai shipping volume than Hong Kong shipping volume? Looking at this map, it’s hard to tell. Part of that is due to the overlap of the points. But the other part is that we’re not good at evaluating differences in area. If we want to make precise statements about differences, the bar chart is a much better tool.

Closing notes: this is is simpler than it looks

There’s a lot here. This is a massive post, and quite possibly the most information dense post I’ve created yet.

If you really look at the tools and techniques here, you’ll see quite a bit about how to use
ggplot2
,
dplyr
, and the
tidyverse
together to analyze data.

Having said that, there are lots of details here that might make this look very complicated to a beginner.

If you look at this and feel overwhelmed, don’t worry.

In the near future, I’m going to write a blog post that strips this down to bare bones. In spite of some of the surface level complexity, this is simpler than you might think. I’d estimate that only a couple dozen major techniques do most of the heavy lifting.

What that means, is that if you’re a beginner and want to learn to do this, you can get very far by learning only a few highly-used techniques. It’s the 80/20 rule in action.

Sign up to learn about analysis in R

If you want to get a data science job, you need to be great at analyzing data.

In R, that means that you need to be highly skilled in
ggplot2
,
dplyr
, and other core parts of the tidyverse.

Sign up now and we’ll show you how to master these tools.

The post How to do an analysis in R (part 2, visualization and analysis) appeared first on SHARP SIGHT LABS.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Show more