2016-08-21

(This article was first published on Pareto's Playground, and kindly contributed to R-bloggers)

Introduction

When data becomes high-dimensional, the inherent relational structure between the variables can sometimes become unclear or indistinct. One, might want to find clusters for numerous amounts of reasons – me, I want to use it to better understand my childhood. To be more specific, I will be using clustering to highlight different groupings of pokemon. The results of this analysis can then retrospectively be applied to a younger me having to choose which pokemon I catch and keep, or perhaps which I must rather use in battle to gain experience points. The clusters should help me identify groupings of pokemons that assimilate with my style of play, be it catching pokemon who are specialist of their type, strong attackers, survivalist who have good defensive capabilities or pokemon who have the potential to become great as soon as they evolve.

This train of thought (aka hypothesis) will carry through to the analysis when number of cluster and inference needs to be conducted.

About clustering techniques

Within the CRM field, a common practice is to segment client data in order to identify certain clusters of customers, users or products. The first question that I asked myself when I first encountered this technique was: “How usefull can this really be when there is no quantitative measure which dictates the number of clusters”. The answer to this question became quite clear to me when I encountered a dataset of 8000 variables.

Clustering lies within the field of data-reduction, and has the intention to uncover cohesive subgroups of observations within a very large dataset where inference aren’t always clear from face value – i.e an Excel pivot table isn’t going to cut it. Clustering is not only used within marketing, but is applied in biology, behavioural sciences, economics and medical research. The application of cluster analysis in the medical sciences interest me, as they use this technique to help catalog gene-expression patters which were obtained from DNA microarray data. Very cool application of this statistical technique. Without clustering, this task would become almost impossible due to the large amount of information.

Clustering can primarily be divided up in two techniques:

Hierarchical Agglomerative methods

Partitioning clustering

Hierarchical agglomerative clustering works from a bottom up approach where at the beginning, each observation is in its own cluster. From this clusters are combined into larger clusters, 2 at a time, until all the cluster have essentially been merged into one big cluster. With partitioning, one will specify K cluster which are sought after. Then the algorithm will essentially start by picking randomly dividing observations into clusters, assessing similarity, reshuffle, asses again and keep doing this until cohesive clusters are formed.

There is a lot of different techniques, but for the rest of this exploration into the topic, I will be using hierarchical agglomerative clustering (hclust) as my choice of clustering algorithm.

Now, the question is: “How are the observations linked to form these clusters?”. In hclust the most popular techniques are:

Single linkage: Shortest distance between a point in one cluster and a point in the other cluster.

Complete linkage: Longest distance between a point in one cluster and a point in the other cluster.

Average linkage: Average distance between each point in one cluster and each point in the other

cluster (also called UPGMA [unweighted pair group mean averaging])

Centroid: Distance between the centroids (vector of variable means) of the two clusters.

For a single observation, the centroid is the variable’s values.

Ward’s Method: The ANOVA sum of squares between the two clusters added up over all the

variables.

You are probably asking – what distance are you talking about here? We will use the dist() function in R to calculate the euclidean distance:



where q and p are the observations and N is the number of variables. Easy enough right? I will be using ward’s method to cluster my objects as it is the default setting for Hierarchical agglomerative clustering for the HCPC function in library(FactoMineR)

Lets go catch our pokemon

To collect the data on all the first generation pokemon, I employ Hadley Wickam’s rvest package. I find it very intuitive and can handle all of my needs in collecting and extracting the data from a pokemon wiki. I will grab all the Pokemon up until to Gen II, which constitutes 251 individuals. I did find the website structure a bit of a pain as each pokemon had very different looking web pages. But, with some manual hacking, I eventually got the data in a nice format.

The cleaned data looks as follows:

For those of you who know pokemon well, will also know that certain pokemon have only one evolution stage. We remove them in order to not intervere with our clustering at a later stage.

The following pokemon were removed from the dataset: Farfetch’d, Kangaskhan, Pinsir, Tauros, Lapras, Ditto, Aerodactyl, Articuno, Zapdos, Moltres, Mewtwo, Mew, Unown, Girafarig, Dunsparce, Qwilfish, Shuckle, Heracross, Corsola, Delibird, Skarmory, Stantler, Smeargle, Miltank, Raikou, Entei, Suicune, Lugia, Ho-oh, Celebi

The next step was to aggregate all the stage 2 statistics for the stage 1 evolution pokemon. This results in a nice wide dataset of variables we are able to use in our clustering. I am hoping to extract some sense of strenghts in pokemon, not only by their stage 1 statistics, but also perhaps their potential to become awesome assets later. One hopeful example of this would be everyone’s favourite: magikarp.

The data has been constructed to capture characteristics of a specific pokemon at both the first and second level of evolution. It is these variables that we will be using in our clustering. The total column acts as a general indicator of pokemon’s attributes:

HP.lv1

Atk.lv1

Def.lv1

SA.lv1

SD.lv1

Spd.lv1

Ekans

30

60

44

40

54

55

Spinarak

40

60

40

40

40

30

Chikorita

45

49

65

49

65

45

Charmander

39

52

43

60

50

65

Totodile

50

65

64

44

48

43

Seel

65

45

55

45

70

45

Table: Table continues below

Total.lv1

Type.I.lv1

lvl_up.lv1

HP.lv2

Atk.lv2

Ekans

283

Poison

22

60

85

Spinarak

250

Bug

22

70

90

Chikorita

318

Grass

16

60

62

Charmander

309

Fire

16

58

64

Totodile

314

Water

18

65

80

Seel

325

Water

34

90

70

Table: Table continues below

Def.lv2

SA.lv2

SD.lv2

Spd.lv2

Total.lv2

Ekans

69

65

79

80

438

Spinarak

70

60

60

40

390

Chikorita

80

63

80

69

414

Charmander

58

80

65

80

405

Totodile

80

59

63

58

405

Seel

80

70

95

70

475

For those curious to see the pokemon which ended up in our dataset. Here they are:

One of the interesting variables that forms part of the data is the crucial question every trainer asks himself when catching pokemon – which types are the strongest? For each of my ~60 pokemon I use a boxplot to evaluate the relative strenght of a type. The data was normalized per level of evolution to ease plotting and interpretation.


Good old bug pokemons don’t catch a break, with the median total points being the lowest in both stages of evolution. For Normal, Poison and Water types, there seems to be a definite advantage to evolve in order to ‘up’ the overall statistics. An interesting type to evaluate is the Fire type pokemon. In the first stage of evolution this type of pokemon seems to have an overall advantage, but once the pokemon starts evolving, the advantage dissipates.

One of the concerns I had was the class imbalance that might be present in the pokemon type.


The graph clearly points this out, so instead of adding the pokemon types into the analysis, they will be included as supplementary variable.

Going Prof Elm and analysing pokemon

If you don’t know who Prof Elm is, this link should help. To start our exploration into the pokemon dataset, we will conduct a multiple factor analysis. I find the flexibility of this function being able to conduct MCA and PCA in one go very helpful. It also has incredible plotting functions that helps to visually analyse your data.

Here we see that there is a definite inverse relationship between the speed of a Pokemon and its attack statistics. I find the relationship between special attack and normal attack interesting. It would seem that you either specialise or defualt to having a strong overall attack.

The first thing to look at is the dispersion of the pokemon types to see how correlation all the pokemons are given their type.

It would seem that Ghost, Psyhic and Electric pokemon have different characteristics than those of Ground, Fighting and Bug for instance.

Given these factor groups, next it would be interesting to see which of the pokemon had the highest contribution to the construction dimension (Contribution 10). I also want to see the highest quality of representation (cos2>0.6). The cos squared, indicates the contribution of a component to the squared distance of the observation to the origin. i.e cos squared is an important contributor to find the components that are important to interpret both active and supplementary observations, Abdi H, 2010.

Now that we have a clearer understanding of the data, its finally time to conduct the exciting cluster analysis on the data. With the MFA function, this is easily done by plugging the results directly into a hierarchical clustering algorithm. I decided to cut the tree to get 4 clusters in the end. Felt that these represented the different facets of pokemon quite well.

Here we can see a 3D representation of the tree that was build:

I wanted to see the type dispersion among the clusters, perhaps hoping to see a coherent split of types among the clusters…

1

2

3

4

Bug

0

1

3

2

Dark

1

0

0

0

Dragon

0

0

1

0

Electric

2

0

0

0

Fighting

0

0

1

1

Fire

2

0

2

0

Ghost

1

0

0

0

Grass

2

0

3

0

Ground

0

0

1

4

Ice

1

0

0

1

Normal

0

1

5

2

Poison

0

0

4

2

Psychic

2

1

0

0

Rock

1

0

0

2

Water

3

3

4

4

Sum

15

6

24

18

Now comes the interesting bit, dissecting the data to see which pokemon clustered together and why they were thrown into the same cluster.

Cluster 1

Ok, so the pokemon that ended up in cluster 1 was: Voltorb, Oddish, Psyduck, Gastly, Houndour, Bulbasaur, Smoochum, Abra, Slugma, Magnemite, Remoraid, Omanyte, Ponyta, Horsea, Natu. This is quite an odd bunch if you know pokemon well. So, what was it that made them cluster together?

Stat

v.test

Mean.in.category

Overall.mean

p.value

SA.lv1

6.2

77

49

0

SA.lv2

5.9

99

71

0

Total.lv1

2.5

320

297

0.01

Spd.lv1

2

61

52

0.05

HP.lv1

-2

39

46

0.04

Atk.lv1

-2.1

45

54

0.04

Atk.lv2

-2.1

67

79

0.03

HP.lv2

-2.3

61

70

0.02

These Pokemon are specialist in special attacking moves. With the mean attack stat being almost 1/4 higher than the mean of all the other pokemon. These pokemon should be used when there is a type advantage!

Cluster 2

Moving onto cluster 2: Seel, Drowzee, Chinchou, Ledyba, Hoothoot, Tentacool.

Stat

v.test

Mean.in.category

Overall.mean

p.value

SD.lv1

4.8

75

46

0

SD.lv2

4.5

102

70

0

HP.lv2

2.8

89

70

0

Atk.lv2

-2

59

79

0.05

Atk.lv1

-2.4

37

54

0.02

Although these pokemon have defensive capabilities, these capabilities are more pronounced than what we saw in cluster 1 where the individuals had a small significant advantage in a specialized situation.

Cluster 3

This cluster was the biggest with 24 Pokemon ending up in this category. The pokemon that forms part of this cluster is: Ekans, Spinarak, Chikorita, Charmander, Totodile, Doduo, Dratini, Diglett, Spearow, Sentret, Zubat, Magikarp, Weedle, Caterpie, Nidoran(F), Nidoran(M), Meowth, Pidgey, Poliwag, Mankey, Cyndaquil, Hoppip, Squirtle, Bellsprout.

Stat

v.test

Mean.in.category

Overall.mean

p.value

SD.lv1

-2.1

41

46

0.04

Def.lv1

-2.5

42

50

0.01

HP.lv2

-2.5

63

70

0.01

HP.lv1

-2.6

39

46

0.01

SA.lv1

-2.7

40

49

0.01

SD.lv2

-2.8

62

70

0

Def.lv2

-3

64

75

0

SA.lv2

-3.5

58

71

0

Total.lv1

-3.8

272

297

0

Total.lv2

-3.9

396

434

0

lvl_up.lv1

-4.5

20

25

0

It would seem that these pokemon underscore on all statistics and should probably we avoided as strategic investments in your lineup.

Cluster 4

The last group that was identified were Pokemon that are all-rounders. Phanpy, Pineco, Snubbull, Kabuto, Krabby, Machop, Cubone, Grimer, Paras, Swinub, Larvitar, Wooper, Rhyhorn, Sandshrew, Goldeen, Slowpoke, Teddiursa, Koffing – have overall statistics which are higher than average, but does not specialize in any of the type specific advantages that exist in the Pokemon games.

Stat

v.test

Mean.in.category

Overall.mean

p.value

Atk.lv2

4.6

102

79

0

Def.lv2

4.5

96

75

0

Atk.lv1

4.3

70

54

0

Def.lv1

4.3

68

50

0

HP.lv1

3.5

56

46

0

lvl_up.lv1

3.1

30

25

0

HP.lv2

3

81

70

0

Total.lv2

2.1

460

434

0.03

SD.lv1

-2

40

46

0.05

SA.lv1

-2.6

38

49

0.01

Spd.lv2

-3.8

51

70

0

Spd.lv1

-4

35

52

0

Paragons of the clusters

One of the nicest outputs of HCPC is the listed paragons per group. These paragons can be seen as the poster-child representative of each cluster as they lie closest to the center of the cluster:

Cluster 1:

Natu

Psyduck

Houndour

Remoraid

Bulbasaur

0.67

1.2

1.3

1.3

1.4

Cluster 2:

Drowzee

Hoothoot

Seel

Tentacool

Chinchou

0.88

1.1

1.2

1.7

1.7

Cluster 3:

Nidoran(M)

Zubat

Pidgey

Poliwag

Sentret

0.57

0.65

0.71

0.83

0.89

Cluster 4:

Snubbull

Phanpy

Machop

Larvitar

Swinub

1

1.1

1.1

1.1

1.2

The output of the HCPC also allows us to see the individuals which are the most distance from all the other cluster. For brevity, ill leave this table out of this post. These most distant individuals are usually outliers which end up in the group to which it relates (relatively) the closest.

#Conclusion

This post explored cluster analysis, or formally put, a tool in which a multivariate dataset can be explored and eventually be divided into subgroups of similar data based on some kind of proximity estimate. I used Pokemon data as I found it to be an interesting dataset to apply this kind of technique. I must say, that the FactoMineR library helps a lot in facilitating the clustering process once your dataset is in proper format. It takes you through a natural progression of clustering application with a lot of flexibility available to the user to tune the analysis to his/hers needs. I especially like the MFA function where both PCA and MCA can be integrated into a concise function.

In terms of the results, I think it would be interesting to further delve into type combinations within the group in order to have the strongest Pokemon along with a type advantage. But we will leave this for another day…

To leave a comment for the author, please follow the link and comment on their blog: Pareto's Playground.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Show more