2016-04-19

<<Mandatory Spoiler Warning>>

Information from the first 5 books is discussed, which corresponds to the first 5 TV seasons with some variations (the TV series is more advanced but omits a number of secondary characters for clarity and budgetary reasons).

<<End of Warning>>

Does fiction really have to make sense? This blog explores the application of data science to storytelling through the lens of G.R.R. Martin’s epic series (~1.4 million words). Game of Thrones (GoT) has, in part, become famous because any character is “fair game” and can die at any moment, regardless of how prevalent s/he was up to that point. Are those deaths the result of “whims of the author”, or are they part of a longer constructed narrative?

To answer these questions, we turned to network graphs. Starting with raw text data, we:

Automatically created a social network of characters for any time period of the story

Analysed the structure and storylines of individual books through network relations

Measured the importance and position of characters throughout the story based on their network centrality and graph visualisations

Accurately predicted major character deaths with a Belief Propagation algorithm (LBP) over the social network

The entire graph-based findings only required the text data and a list of character names. No “expert” information about the books or story was needed. In spite of this apparent simplicity, the created networks faithfully represent the story and major character deaths are accurately predicted.

Why the books?

Both books and TV series require about 50 hours to read/watch and contain loosely[2] the same information. Text, however, is much easier to analyse and manage than video. Moreover, the books correspond to G.R.R. Martin’s original story, while adaptations always rewrite the material to some extent.

First, we loaded e-book files into an Aster database and separated them into chapters; preserving the entire text content. Because GoT is written as a series of “Point of View” (PoV) chapters, i.e., each chapter is told from a character’s perspective, we can estimate whose story is effectively being told over the books by counting the number of words devoted to the perspective of each PoV character.



Despite its simplicity, word count yields valuable insights on the story structure:

The first two books are told from the Starks’ perspective (Tyrion and Daenerys, two outcasts, are the others), naturally leading to the “Stark = good , Lannister = evil” perception of the story

The share of the story told by “second-tier” PoV characters (4th onwards) steadily decreases over time as the story becomes more complex and new PoV characters are introduced.

While only four[3] PoV characters are definitely dead by this point[4], Eddard Stark’s death comes as a surprise because he is the main storyteller of the first book.

Beyond word counts: creating a social network from text

GoT is a long and complicated story, with a lot of characters. Perhaps counter-intuitively, this complexity is an advantage: instead of constant pronoun references, GoT usually spells out full character names[5]. Moreover, the PoV character name itself is frequently mentioned in its respective chapters[6].

We created social networks from text information by finding mentions of 102 characters[7] throughout the books. This includes common nicknames from the book appendixes, e.g., Kingslayer, Khaleesi.

In each chapter, we counted the number of occurrences of all character names to create a link between the PoV character and the characters whose name occur in the chapter. For example, in an “Arya” PoV chapter, if ‘Arya’ is mentioned 20 times, ‘Sansa’ 12 times and ‘Jon’ 11 times, we would have (refer to table):



Above: Output of the character count algorithm. Each mention of the 102 “tracked characters” is counted for each chapter. The R-value and last row (Sansa-Jon) are not directly obtained by text parsing but in a subsequent step described below.

This methodology naturally produces a “spoke” network since relationships between non-PoV characters are not captured. We derive “cross-character” relations by:

1) Normalizing character mentions (per chapter) to create a mention value R

2) Calculating the R between 2 non-PoV characters with an Euclidian-type distance measure[8]

Output of this process is shown in the table above. For any single chapter, the relationship strength (R) between two non-PoV characters cannot exceed their relationship strength with the PoV character.

Measuring relationships between characters on a chapter-basis, we can create a social network for any single chapter, or between any two points in time in the story by adding up the R values for the considered chapters. We visualided the graphs through Aster App Center, with the following properties:

Node color corresponds to the output of the modularity algorithm (graph-based clustering of social communities)

Node size represents the number of mentions of a character

Edge thickness and color represent the relative strength of connections between characters (thicker and redder = stronger)

The individual book (TVs season) is arguably the most natural scale to portion the story; we generated graph visualisations for each book (non-cumulative information) and one covering the entire story (all books from page 1).

Book 1 leads to a large purple cluster centered on the events at King’s Landing with strong links between Eddard, King Robert and the Small Council. The large red cluster concerns the events in the North and the Riverlands, with a strongly connected Stark family and Tyrion’s capture later in the book. Jon’s move to the Wall gives him his own cyan cluster while the events across the sea are correctly represented with the small but strongly connected green cluster of the Targaryens/Dothrakis.



Above: Social Network Graph of “A Game of Thrones”: the first book of “A Song of Ice and Fire” Series. The clusters and structure of the network adhere well to the story despite having been obtained automatically from the books “raw text”. The stronger relationships can be seen between Eddard and Robert, Eddard and Catelyn, Daenerys and Khal Drogo

The second book‘s main story relates events in King’s landing (cyan) with Tyrion, Sansa, and Joffrey as well as the “war of the kings” involving the Starks and Baratheons (yellow). A number of smaller individual clusters appear around Bran in Winterfell (red) and Arya’s (green) escape detached from the “main story”. The Wall (purple) and Slaver’s bay (blue) still have their individual, fairly independent clusters.

Above: Social Network Graph of “A Clash of Kings”: the second book of “A Song of Ice and Fire” Series. The stronger relationships can be seen between Joffrey-Sansa-Tyrion, Renly-Stannis. We see new strong relationships developing between Jon-Qhorin as well as Daenerys and Jorah. Compared to GoT, the second book has a more complex structure, with two main stories (King’s Landing and the Riverlands) flanked by smaller detached clusters that follow central PoV characters.

The graph-based progression of the books illustrates what most readers know[9]: more characters and sub-stories appear (Dorne and the Iron Islanders in Book 4, Quentyn Martell in Book 5). The corresponding social networks take the shape of smaller inter-relationships centered on PoV characters that have less and less contact with each other. This structure is particularly apparent in books 4 and 5 despite the intertwined timeline (Book 4 and 5 are split geographically rather than temporally)[10].

Above: Social Network Graph of “A Storm of Swords”: the third book of “A Song of Ice and Fire” Series. The book continues the evolution towards smaller, more separated communities, with Aria and Bran having their own storylines that intersect little with the smaller main story (Robb-Catelyn-Walder “Red Wedding” and Joffrey-Sansa-Tyrion “Purple Wedding”). A notable change is the growth of the Wall cluster with the arrival of Stannis in the North (pink).

Above: Social Network Graph of “A Feast for Crows”: the fourth book of “A Song of Ice and Fire” Series. The network takes a very distinct shape, with almost isolated “spokes” communities and the appearance of new characters (Dorne in light blue and the Iron Islanders in Red). The book is about Cersei and Jaime; whereas Daenerys and the Wall are not part of book 4.

Above: Social Network Graph of “A Dance with Dragons”: the fifth book of “A Song of Ice and Fire” Series. The final book of the series so far continues to illustrate the change in Network Structure, with fewer links overall and a more central structure around its main characters. Tyrion has left Westeros and now appears in Slaver’s Bay. The proliferation of “side stories” and the introduction of main characters have stretched the network into a series of individuals.

Measuring Character Importance with Graphs

A network graph structure for the story enables us to understand and measure the relative importance of characters better than through word count alone: notably with the centrality measures of ‘betweenness‘ (how much information transits through a node) and ‘closeness‘ (how central a node is within their local network).

Characters that travel or act as gateway to their communities score high on betweenness, while characters central to their (large) communities have a high closeness. Characters like Jon or Tyrion “Link” other characters (and the story) together, while characters with high closeness, e.g., Catelyn or Sansa form strong relationships within their network. Tyrion scores high in both, making him the overall central character of the story so far.

Above: Betweenness and Closeness values for characters across the books. Book 1 is about Eddard, Book 2 about Tyrion and Joffrey, Book 3 is the best balanced. Book 4 is Lannister-centric, while the last is all about the North. The increase of betweenness in book 5 is a direct result of a more decentralized network (see corresponding graphs).

Above: The story so far: Social network graph over the entire 5 books. A Song of Ice and Fire is the story of the Lannisters (Red), the Starks (green, dark blue, purple) and a Woman with dragons (Gold). From a network structure perspective, Tyrion and Jon are the heroes of the story. Looking at dead characters so far, we note they are all either centrally located within their network or are leaf nodes (i.e., few ‘hub’ characters are killed)[11].

Eddard’s death: a surprise?

Being able to calculate social networks at any point of the story allows us to monitor the state of the social network up to the point of major deaths (i.e. the chapter before the death occurs) and determine whether the character dying was a surprise, or it was foreordained by the network structure and previous deaths.

To predict character deaths, we employ a Belief Propagation algorithm (LBP), which propagates ground truth beliefs of a character status (i.e. true information, with 0 = alive, 1 = dead) throughout the graph, according to edge weights[12].

“Dead” ground truth values are straightforward to obtain (a character is assigned a ground truth of “1” upon his/her passing). On the other hand, “Alive” ground truth is harder to define in a series notorious for its high death rate. We can, however, use LBP to show that major character deaths can be predicted from social networks only and test our initial “Alive” hypothesis at the same time (i.e., if our ground truth leads to accurate predictions then it is believable).

Our prediction rules are:

Jon Arryn, Viserys and Robert Baratheon are “dead” (ie, we do not make prediction before Robert’s death, not enough information)

When characters die (major or not) we add them to the “dead” ground truth

Daenerys and Jon Snow are alive[13] (our only “alive” ground truth)

The ground truth “beliefs” (dead or alive) are propagated along the network with the R value calculated earlier[14] . Importantly, LBP doesn’t use any “future information” or learnings to calculate beliefs (apart from the ‘alive’ ground truth), only the structure of the social network at the time of the event is being used.

Who’s Next? LBP death prediction results

Death risk (top ranked characters) before Eddard Stark’s execution

Above: The “probability of death” (2nd column of the above table) should not be taken literally; we do not have near enough ground truth data (or data altogether) for the value to be precise, the order is more meaningful here.

At that point in the story (the chapter before his death), Eddard is the most likely person to die. Note that Khal Drogo (who dies at about the same time) is second on the list.

Arguably, Eddard was central to all network measures, so this result could be due to “the most central character is marked as the likeliest to die”. In reality, the state of the social network at the time of Eddard’s death indicates that it is his close proximity to King Robert that raises his risk.

Following on, we have:

Death risk (top ranked characters) before Renly Baratheon’s Murder

Above: After Eddard and Khal Drogo, Renly is the next major[15] character death. Moreover, he is not a particularly central character: low betweenness, and no PoV chapter.  The LBP algorithm accurately picks him as the likeliest character to die.

Death risk (top ranked characters) before the “Red Wedding”.

The chapter (and TV episode) labeled as the “Red Wedding” sees the whole Stark + Friend host getting murdered, but most notably Robb and Catelyn Stark.

LBP’s prediction continues to be reliable: the “likeliest to die” character is determined to be Catelyn Stark, while 5 of the 10 people with the highest likelihoods are part of the “Red Wedding” itself. Catelyn scores significantly above her son Robb due to her closer involvement with Stannis and Renly (in the second book). At that point in the books, the relationship between Robb and Catelyn is the strongest between any two characters, i.e., when one dies, the other one will instantly top the risk list.

The accuracy of LBP over the first 3 books is remarkable considering the limited information it uses[16]. In fact, only Joffrey’s death is missed out (he ranks 7th out of ~80 characters). In the books there is little time between the two weddings (10 chapters or so); without new ground truth or relationships, the LBP predictions are heavily skewed towards the survivors of the “Red Wedding”.

From the 4th book onwards, however, the accuracy of LBP does go down for three reasons:

The number of dead people keeps on growing while additional “survivors” are hard to pin down without using 20/20 hindsight, so most characters are predicted to die

The story becomes less focused, with a “hub and spoke” model[17]; propagating beliefs along this type of graphs can be inaccurate

Book 4 and 5 occur “at the same time”[18], making propagation prediction more difficult.

Is Jon Snow Dead?

Despite its simplicity and limitations, LBP provides amazingly accurate information.

By the end of book 5 the “timeline” is back to normal, allowing us to predict who dies in book 6 based on the state of the social network over the entire story.

However, we had to make a decision: Jon Snow was labeled with “alive” status for the analysis. LBP therefore predicted that he is alive. If we removed his ground truth status, most characters will be predicted to die (since only Daenerys would remain as “alive”).

We therefore run the algorithm under two distinct hypotheses:

Label Jon Snow ‘unknown’ and Tyrion as ‘alive’ to predict who’s next to die under the assumption that Jon Snow is “fair game”

Keep Jon ‘alive’ (our standing hypothesis).

Book 6 Prediction

Above: The numbers are close to each other, which is expected considering the low number of ‘alive’ characters. Relationships are built over 5 books though; small differences are more meaningful than in earlier measures.

Conclusion

If Jon Snow isn’t manually kept alive, he is dead[19]. If Jon is alive then Brienne, Walder (Frey), Sam, Edmure and the Lannisters are top the list[20].

Do not miss!!!!

Read the second part in the ‘Predicting Deaths in Game of Thrones’ blog series tomorrow on Event-based Survival Analysis and Conclusion.

————————————————————————

Footnotes

* While the book is called “A Song of Ice and Fire”, most people know it as Game of Thrones

[2] Books have more detailed content and more side-stories

[3] Catelyn (no lady Stoneheart shenanigans here), Eddard, Arys Oakheart and Quentyn Martell

[4] In fact being a PoV is the second most accurate single feature to predict survival

[5] Necessary when so many characters are mentioned in a single chapter

[6] In fact, it is the most frequently mentioned name in all but a couple of chapters

[7] From http://iceandfire.wikia.com

[8] Specifically, if two characters A and B that have a connection strength with the PoV character of R(A) and R(B), respectively, we calculate the strength of the relationship A-B as  1 –  sqrt(  (  (1-R(A)) * (1-R(A))  + (  (1-R(B)) * (1-R(B)) )

[9] And most physicists as well: entropy always increases

[10] The TV series have omitted or shortened a number of side characters and stories and kept a unified timeline, it is therefore “more coherent”

[11] Which bodes well for Jon Snow

[12] This algorithm has previously been used on historical Renaissance data here: http://blogs.teradata.com/international/a-crazy-belief-predicting-outcomes-from-network-graphs/

[13] It is “A song of ICE and FIRE” after all…

[14] I have calculated modifications on the R value depending on the characters being from friendly/enemy houses or allegiance relationships. It changed the numbers somewhat but not the ordering of the “characters likely to die/survive”

[15] With a significant role in the TV series

[16] Does G.R.R. Martin know more about graph theory than he lets on?

[17] See the network graphs for book 4 and 5

[18] The TV series “fixes” this

[19] That may turn out to be literally true

[20] As a gratuitous hedge, these predictions are for the books; the TV series may turn out to be different

Related Posts

Data Analytics in 2016- The 6 Trends We Must Be Aware Of

Is Being a Data Scientist Really That Sexy? Part 2

What Social Media Analytics Can Reveal For Your Business

Is Your Company Product or Service Centric? Do Your Customers Agree?

A Race Towards Cloud: Commodisation of Systems and Analytics?

The post Who’s Next? Predicting Deaths in Game of Thrones* – Part 1: From books to social networks appeared first on International Blog.

Show more