2015-08-19

Introduction

An important step in data analysis is data exploration and representation. We have already seen some concepts in Exploratory Data Analysis and how to use them with both, Python and R. In this tutorial we will see how by combining a technique called Principal Component Analysis (PCA) together with Cluster Analysis we can represent in a two-dimensional space data defined in a higher dimensional one while, at the same time, being able to group this data in similar groups or clusters and find hidden relationships in our data.

More concretely, PCA reduces data dimensionality by finding principal components. These are the directions of maximum variation in a dataset. By reducing a dataset original features or variables to a reduced set of new ones based on the principal components, we end up with the minimum number of variables that keep the maximum amount of variation or information about how the data is distributed. If we end up with just two of these new variables, we will be able to represent each sample in our data in a two-dimensional chart (e.g. a scatterplot).

As an unsupervised data analysis technique, clustering organises data samples by proximity based on its variables. By doing so we will be able to understand how each data point relates to each other and discover groups of similar ones. Once we have each of this groups or clusters, we will be able to define a centroid for them, an ideal data sample that minimises the sum of the distances to each of the data points in a cluster. By analysing these centroids’ variables we will be able to define each cluster in terms of its characteristics.

But enough theory for today. Let’s put these ideas in practice by using Python and R so we better understand them in order to apply them in the future.

All the source code for the different parts of this series of tutorials and applications can be checked at GitHub. Feel free to get involved and share your progress with us!

Preparing our TB data

We will continue using the same datasets we already loaded in the part introducing data frames. The Gapminder website presents itself as a fact-based worldview. It is a comprehensive resource for data regarding different countries and territories indicators. Its Data section contains a list of datasets that can be accessed as Google Spreadsheet pages (add &output=csv to download as CSV). Each indicator dataset is tagged with a Data provider, a Category, and a Subcategory.

For this tutorial, we will use a dataset related to prevalence of Infectious Tuberculosis:

TB estimated prevalence (existing cases) per 100K

We invite the reader to repeat the process with the new cases and deaths datasets and share the results.

For the sake of making the tutorial self-contained, we will repeat here the code that gets and prepared the datasets bot, in Python and R. This tutorial is about exploring countries. Therefore, we will work with datasets where each sample is a country and each variable is a year.

R

In R, you use read.csv to read CSV files into data.frame variables. Although the R function read.csv can work with URLs, https is a problem for R in many cases, so you need to use a package like RCurl to get around it.

Python

So first, we need to download Google Spreadsheet data as CSV.

Now that we have it locally, we need to read the CSV file as a data frame.

We have specified index_col to be 0 since we want the country names to be the
row labels. We also specified the thousands separator to be ‘,’ so Pandas
automatically parses cells as numbers. We can use head() to check the first
few lines.

year

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

country

Afghanistan

436

429

422

415

407

397

397

387

374

373

346

326

304

308

283

267

251

238

Albania

42

40

41

42

42

43

42

44

43

42

40

34

32

32

29

29

26

22

Algeria

45

44

44

43

43

42

43

44

45

46

48

49

50

51

52

53

55

56

American Samoa

42

14

4

18

17

22

0

25

12

8

8

6

5

6

9

11

9

5

Andorra

39

37

35

33

32

30

28

23

24

22

20

20

21

18

19

18

17

19

Dimensionality reduction with PCA

In this section, we want to be able to represent each country in a two dimensional space. In our dataset, each sample is a country defined by 18 different variables, each one corresponding to TB cases counts per 100K (existing, new, deaths) for a given year from 1990 to 2007. These variables represent not just the total counts or average in the 1990-2007 range but also all the variation in the time series and relationships within countries in a given year. By using PCA we will be able to reduce these 18 variables to just the two of them that best captures that information.

In order to do so, we will first how to perform PCA and plot the first two PCs in both, Python and R. We will close the section by analysing the resulting plot and each of the two PCs.

R

The default R package stats comes with function prcomp() to perform principal component analysis. This means that we don’t need to install anything (although there are other options using external packages). This is perhaps the quickest way to do a PCA, and I recommend you to call ?prcomp in your R console if you’re interested in the details of how to fine tune the PCA process with this function.

The resulting object contains several pieces of information related with principal component analysis. We are interested in the scores, that we have in pca_existing$x. We got 18 different principal components. Remember that the total number of PCs corresponds to the total number of variables in the dataset, although we normally don’t want to use all of them but the subset that corresponds to our purposes.

In our case we will use the first two. How much variation is explained by each one? In R we can use the plot function that comes with the PCA result for that.



Most variation is explained by the first PC. So let’s use the first two PCs to represent all of our countries in a scatterplot.

Now that we have them in a data frame, we can use them with plot.



Let’s set the color associated with the mean value for all the years. We will use functions rgb, ramp, and rescale to create a color palette from yellow (lower values) to blue (higher values).



Now let’s associate colour with total sum.

And finally let’s associate it with the difference between first and last year, as a simple way to measure the change in time.

We have some interesting conclusions already about what PC1 and PC2 code as a representation of the years from 1990 to 2008. We will explain that right after showing how to perform dimensionality reduction using Python.

Python

Python’s sklearn machine learning library comes with a PCA implementation.
This implementation uses the scipy.linalg implementation of the singular value decomposition. It
only works for dense arrays (see numPy dense arrays or
sparse array PCA if you are using sparse arrays) and is not scalable to large dimensional data. For large dimensional data we should consider something such as Spark’s dimensionality reduction features. In our case we just have 18 variables, and that is far from being a large number of features for today’s machine learning libraries and computer capabilities.

When using this implementation of PCA we need to specify in advance the number
of principal components we want to use. Then we can just call the fit() method
with our data frame and check the results.

This gives us an object we can use to transform our data by calling transform.

Or we could have just called fit_transform to perform both steps in one single
call.

In both cases we will end up with a lower dimension representation of our data
frame, as a numPy array. Let’s put it in a new data frame.

PC1

PC2

country

Afghanistan

-732.215864

203.381494

Albania

613.296510

4.715978

Algeria

569.303713

-36.837051

American Samoa

717.082766

5.464696

Andorra

661.802241

11.037736

We can also print the explained variance ratio as follows.

We see that the first PC already explains almost 92% of the variance, while the
second one accounts for another 6% for a total of almost 98% between the two of
them.

Now we are ready to plot the lower dimensionality version of our dataset. We just need to call plot on the data frame, by passing the kind of plot we want (see here for more on plotting data frames) and what columns correspond to each axis. We also add an annotation loop that tags every point with its country name.

Let’s now create a bubble chart, by setting the point size to a value proportional to the mean value for all the years in that particular country. First we need to add a new column containing the re-scaled mean per country across all the years.

PC1

PC2

country_mean

country_mean_scaled

country

Afghanistan

-732.215864

203.381494

353.333333

0.329731

Albania

613.296510

4.715978

36.944444

0.032420

Algeria

569.303713

-36.837051

47.388889

0.042234

American Samoa

717.082766

5.464696

12.277778

0.009240

Andorra

661.802241

11.037736

25.277778

0.021457

Now we are ready to plot using this variable size (we will omit the country names this time since we are not so interested in them).

Let’s do the same with the sum instead of the mean.

And finally let’s associate the size with the change between 1990 and 2007. Note that in the scaled version, those values close to zero will make reference to those with negative values in the original non-scaled version, since we are scaling to a [0,1] range.

country_change

country_change_scaled

country

Afghanistan

-198

0.850840

Albania

-20

1.224790

Algeria

11

1.289916

American Samoa

-37

1.189076

Andorra

-20

1.224790

PCA Results

From the plots we have done in Python and R, we can confirm that the most variation happens along the y axis, which we have assigned to PC1. We saw that the first PC already explains almost 92% of the variance, while the second one accounts for another 6% for a total of almost 98% between the two of them. At the very top of our charts we saw an important concentration of countries, most of them developed. While we descend that axis, the number of countries is more sparse, and they belong to less developed regions of the world.

When colouring/sizing points using two absolute magnitudes such as average and total number of cases, we can see that the directions also correspond to a variation in these magnitudes.

Moreover, when using color/size to code the difference in the number of cases over time (2007 minus 1990), the color gradient mostly changed along the direction of the second principal component, with more positive values (i.e. increase in the number of cases) coloured in blue or with bigger size . That is, while the first PC captures most of the variation within our dataset and this variation is based on the total cases in the 1990-2007 range, the second PC is largely affected by the change over time.

In the next section we will try to discover other relationships between countries.

Exploring data structure with k-means clustering

In this section we will use k-means clustering to group countries based on how similar their situation has been year-by-year. That is, we will cluster the data based in the 18 variables that we have. Then we will use the cluster assignment to colour the previous 2D chart, in order to discover hidden relationship within our data and better understand the world situation regarding the tuberculosis disease.

When using k-means, we need to determine the right number of groups for our case. This can be done more or less accurately by iterating through different values for the number of groups and compare an amount called the within-cluster sum of square distances for each iteration. This is the squared sum of distances to the cluster center for each cluster member. Of course this distance is minimal when the number of clusters gets equal to the number of samples, but we don’t want to get there. We normally stop when the improvement in this value starts decreasing at a lower rate.

However, we will use a more intuitive approach based on our understanding of the world situation and the nature of the results that we want to achieve. Sometimes this is the way to go in data analysis, specially when doing exploration tasks. To use the knowledge that we have about the nature of our data is always a good thing to do.

R

Obtaining clusters in R is as simple as calling to kmeans. The function has several parameters, but we will just use all the defaults and start trying with different values of k.

Let’s start with k=3 asuming that at least, the are countries in a really bad situation, countries in a good situation, and some of them in between.

The result contains a list with components:

cluster: A vector of integers indicating the cluster to which each point is allocated.

centers: A matrix of cluster centres.

withinss: The within-cluster sum of square distances for each cluster.

size: The number of points in each cluster.

Let’s colour our previous scatter plot based on what cluster each country belongs to.

Most clusters are based on the first PC. That means that clusters are just defined in terms of the total number of cases per 100K and not how the data evolved on time (PC2). So let’s try with k=4 and see if some of these cluster are refined in the direction of the second PC.

There is more refinement, but again in the direction of the first PC. Let’s try then with k=5.

There we have it. Right in the middle we have a cluster that has been split in two different ones in the direction of the second PC. What if we try with k=6?

We get some diagonal split in the second top cluster. That surely contains some interesting information, but let’s revert to our k=5 case and later on we will see how to use a different refinement process with clusters are too tight like we have at the top of the plot.

Python

Again we will use sklearn, in this case its k-means clustering implementation, in order to perform our clustering on the TB data. Since we already decided on a number of clusters of 5, we will use it here straightaway.

Now we need to store the cluster assignments together with each country in our
data frame. The cluster labels are returned in clusters.labels_.

And now we are ready to plot, using the cluster column as color.

The result is pretty much as the one obtained with R, with the color differences
and without the country names that we decided not to include here so we can better see the colours. In the next section we will analyse each cluster in detail.

Cluster interpretation

Most of the work in this section is about data frame indexing and plotting. There isn’t anything sophisticated about the code itself, so we will pick up one of our languages and perform the whole thing (we will use R this time).

In order to analyse each cluster, let’s add a column in our data frame containing the cluster ID. We will use that for subsetting.

The last line shows how many countries do we have in each cluster.

Centroids comparison chart

Let’s start by creating a line chart that compares the time series for each cluster centroid. This chart will helps us better understand our cluster results.

Cluster 1

Cluster 1 contains just 16 countries. These are:

The centroid that represents them is:

These are by all means the countries with the most tuberculosis cases every year. We can see in the chart that this is the top line, although the number of cases descends progressively.

Cluster 2

Cluster 2 contains 30 countries. These are:

The centroid that represents them is:

It is a relatively large cluster. Still countries with lots of cases, but definitively less than the first cluster. We see countries such as India or China here, the larger countries on earth (from a previous tutorial we know that China itself has reduced its cases by 85%) and american countries such as Peru or Bolivia. In fact, this is the cluster with the fastest decrease in the number of existing cases as we see in the line chart.

Cluster 3

This is an important one. Cluster 3 contains just 20 countries. These are:

The centroid that represents them is:

This is the only cluster where the number of cases has increased over the years, and is about to overtake the first position by 2007. Each of these countries are probably in the middle of an humanitarian crisis and probably beeing affected by other infectious diseases such as HIV. We can confirm here that PC2 is coding mostly that, the percentage of variation over time of the number of exiting cases.

Cluster 4

The fourth cluster contains 51 countries.

Represented by its centroid.

This cluster is pretty close to the last and larger one. It contains many american countries, some european countries, etc. Some of them are large and rich, such as Russia or Brazil. Structurally the differece with the countries in Cluster 5 may reside in a larger number of cases per 100K. They also seem to be decreasing the number of cases slightly faster than Cluster 5. These two reasons made k-means cluster them in a different group.

Cluster 5

The last and bigger cluster contains 90 countries.

Represented by its centroid.

This cluster is too large and heterogeneous and probably needs further refinement. However, it is a good grouping when compared to other distant clusters. In any case it contains those countries with less number of existing cases in our set.

A second level of clustering

So let’s do just that quickly. Let’s re-cluster the 90 countries in our Cluster 5 in order to firther refine them. As the number of clusters let’s use 2. We are just interested in seeing if there are actually two different clusters withing Cluster 5. The reader can of course try to go further and use more than 2 centers.

Now we can plot them in order to see if there are actual differences.

There are actually different tendencies in our data. We can see that there is a group of countries in our original Cluster 5 that is decreasing the number cases at a faster rate, trying to catch up with those countries with a lower number of existing TB cases per 100K.

While the countries with less number of cases and also slower decreasing rate is.

However, we won’t likely obtain this clusters by just increasing in 1 the number of centers in our first clustering process with the original dataset. As we said, Cluster 5 seemed like a very cohesive group when compared with more distant countries. This two step clustering process is a useful technique that we can use with any dataset we want to explore.

Conclusions

I hope you enjoyed this data exploration as much as I did. We have seen how to use Python and R to perform Principal Componen Analysis and Clustering on our Tuberculosis data. Although we don’t advocate the use of one platform over the other, sometimes it’s easier to perform some tasks in one of them. Again we see that the programming interface in Python is more consistent (we will have this more clear after more tutorials comparing Python and R), clear, and modular. However R has been created around performing statistics and analytics tasks, and more often than not we will have a way of doing something in one or two function calls.

Regarding PCA and k-means clustering, the first technique allowed us to plot the distribution of all the countries in a two dimensional space based on their evolution of number of cases in a range of 18 years. by doing so we saw how the total number of cases mostly defines the principal component (i.e. the direction of largest variation) while the percentage of change in time affects the second component.

Then we used k-means clustering to group countries by how similar their number of cases in each year are. By doing so we uncovered some interesting relationships and, most importantly, better understood the world situation regarding this important disease. We saw how most countries improved in the time lapse we considered, but we were also able to discover a group of countries with a high prevalence of the disease that, far from improving their situation, are increasing the number of cases.

Remember that all the source code for the different parts of this series of tutorials and applications can be checked at GitHub. Feel free to get involved and share your progress with us!

Show more