Hangingtogether.org

Graph-ted on: Dewey and Wikipedia

2014-02-24

Introduction

Since entering the library world, I’ve been fascinated by the substream of uber-geeks which are the Dewey Decimal Classification (DDC) nerds. It was the enthusiastic and technical-leaning banter of Editor-in-Chief Michael Panzer, that enticed me. When I learned that the DDC is designed to be able to describe all possible concepts, and then also that is not a strict hierarchy, but really a web of complex relations, I was intrigued by the foresight of the approach. I see the DDC as a best effort 100 years before anyone would utter “crowdsourcing,” – which is how a modern analog the Wikipedia Category systems works. Using Graph Theory we develop a new way to look at, and compute with DDC and then connect it to Wikipedia Categories.

(In this post we assume some knowledge of Graph Theory, so if you need a quick primer, it’s at the end: Graph Theory tl;dr.)

Representing the Dewey Decimal Classification in Graph Theory

In order to represent DDC using Graph Theory we have to translate its components into graph theory. That means,

Classifications become the “nodes”.

Relationships between Classifications become the “edges.”

Relation colourings:

Notational hierarchy – Red

See References – Green

Class elsewhere – Blue

We are actually defining a directed graph because the relations we have are not in general reciprocal. That is, if a classification A is below B in hierarchy, or A has a “see also” reference to B, B does not necessarily have that reference to A. (Although there are some reciprocal “class elsewhere” relations for example 391 and 646.3 in the 4 hop graph below). Note also that colourings – the mathematical parlance – indicates that relationships are different without assigning the relationship a numerical value.

Visualizations

This has all been quite abstract so far so let’s look at something concrete. We’ll pick classification 394.1, whose caption is the enjoyable “Eating, drinking; using drugs” . The distance to other nodes is how many edges we travel from our source. We will look at all the connections that are n-hops from 394.1. Below is the evolution of visualizations from distances 2, 3, 4, and 5. By 5 hops you can see that the graph begins to be too complicated to visually represent in a single image.

Weighting the Graph

One of the powerful aspects of rendering the classification system as a graph is that we can determine the shortest distance between two nodes. However, as mentioned, our colourings have no intrinsic value, so in order to to find the shortest paths, we have to assign weights to colours. This is variable, but for the rest of the examples I am using the settings: Notational Hierarchy = 0.5, See Reference = 1, Class_elsewhere = 2. That means that our shortest path between 394.1 and 341.33 (“Diplomatic Law”) is 394.1 -> 394-> 395 -> 327.2 -> 341.33, which has a total value of 5.5.

Transforming Wikipedia Categories

Consider a Wikipedia Category. It can contain many Wikipedia articles, each of which has zero or more citations of books. Some of those book citations include the ISBN, which is precisely what the OCLC Classify API is hungry for. Classify will transform an ISBN or OCLC number into a DDC, based on how it is most popularly cataloged in WorldCat. That means we can arrive at a set of classifications associated with the category. From this set it would be instructive if we could somehow average them into one representative classification. In Graph Theory terms this is known as finding the center, and there are several ways to approach this problem. The method we employ here is the betweenness centrality. That is, we calculate the weighted shortest paths between all the classifications associated with the category, and find the nodes that were most touched in all the shortest paths. This is rather abstruse without some examples. We run the algorithm on some categories and below are the top 5 between nodes of:

Category:Cooking techniques:

641 – Food and Drink

641.5 – Cooking

641.3 – Food

613.2 – Dietetics

641.815 – Bread and bread-like foods

Category:Feminism

305.42 – Social role and status of Women

305.43 – Women by occupation

305.9 – People by occupation and miscellaneous social statuses

305 – Social Groups

305.4 – Women

This shows there is some basis for proving that this technique works. The centre of the classifications we find are around the same topic of that organizes the classification. We usually finds a strand or run in the system that hover around the main ideas. One caveat though is that for both of these Categories only about 25% of Wikipedia articles actually contain an ISBN citation. Yet of articles that do include at least one ISBN citation, they average 2.7 citations per article. With that in mind we continue to look at some set of articles that are not organized by topic, and are also more ISBN-rich. What better category than Featured Articles?

Category:Featured articles

930 – History of the ancient world c. 499

940 – History of Europe

361 – Social problems and welfare in general

800 – Literature and Rhetoric

362 – Social welfare problems & services

580 – Plants (Botany)

Featured articles, of which there were ~4,000 gave lead to 25,000 ISBNs, which related to 17,000 Classifications. There are only approximately 50,000 standard Classifications. This at first would seem impressive coverage, but distribution of the coverage in the graph was not calculated, so its still difficult to say how diverse the featured article set is. Computing on this set was very intensive, in fact just computing the centre of these 17,000 Classifications took 9 hours parallelized over 16 cores. These topics seem to be more abstract, and less interrelated. Without being able to infer some sort of connect, next we run the algorithm not on a category, but on the list of articles that your blogger has ever edited (307 articles).

Contributions of User:Maximilianklein

930 – History of the ancient world c. 499

800 – Literature and Rhetoric

302.23 - Media (Means of communication)

910.45 – Ocean travel and seafaring adventures

361 – Social problems and welfare in general

Uncovered here is a latent passion for Seafaring adventures! This is probably owing to the fact that travel articles may have cited ISBN-having travel books which were disproportionately focused around sea-travel. More importantly what arises here is that several of the central Classifications that describe the edit history are similar to the featured articles results. This would seem to suggest that there are some very central nodes that most connections are channelled through if they lie at disparate regions of the graph. The implication is that the problem of classifying an arbitrary group of object remains difficult! Of course these results are only for one possible set of values for our weights dictionary, where these top level channels would be more “expensive” to hop-through. So future research should try and calibrate these values to produce the most varied set of central nodes.

This work is available as an Ipython notebook and on github, and comments welcome

–notconfusing

Graph Theory? tl;dr

We are not talking about the colloquial meaning of “graph”, but rather the mathematical concept.

A Graph is a collection of:

“Nodes” or “Vertices” – the objects

Here letters A through F

“Edges” – relations (directed or undirected)

Here the arrows between letters

Not talking about this kind of graph. (Image By Inductiveload [Public domain], via Wikimedia Commons)

An example of a “Graph Theory” Graph (By David W. at de.wikipedia (Original text : David W.) [Public domain], from Wikimedia Commons)