Using an appropriate network representation and the right tool set are the key factors in successfully merging structured and time-series data for analysis.
In Part 1 of this series, you took your first steps for using Apache Giraph, the highly scalable graph-processing system, alongside Apache Hadoop. In this installment, you’ll explore a general use case analyzing time-dependent, Big Data graphs using data from multiple sources. You’ll learn how to generate random large graphs and small-world networks using Giraph – as well as play with several parameters to probe the limits of your cluster.
As you’ll see, using an appropriate network representation, and the right tools for the job (such as Gephi for visualization, a Hadoop-Gephi connector, and CDH), can make this process much easier. (Note: Although neither Giraph nor Gephi are currently supported/shipping inside CDH, you may find this exercise interesting for learning, testing, and development purposes.)
The Use Case
Since Hadoop is a relatively recent innovation, attracting more and more people’s attention and even a lot of money, I was interested in the question: What can Wikipedia tell us about the emerging Hadoop market?
Surprisingly, no single Wikipedia page on this topic exists. A quick search for the terms “Hadoop” and “market”, however, shows 49 results. (That is, 49 Wikipedia pages contain both words.) I did not consider non-English pages, which introduced a strong and not quantified bias. Nobody is equipped to analyze all topic pages in every language, so I had to define a purely data-driven approach.
The solution is simple: Wikipedia pages, like other webpages, can easily be ranked with the classic PageRank algorithm, which is implemented in Giraph. But to do that, we would need the full page graph for every page in Wikipedia – and that is definitely not an option. This raises the question: What is a large graph? Or rather, how large are “large graphs”?
Large Graphs and Efficient Representation
To quantify the size of a graph which is represented in a format, suitable for Apache Giraph, one either specifies the required amount of memory that is needed to store the graph in RAM or disk (which is interesting for long-term storage), or counts the number of nodes and edges. Either approach will reveal something about possible limitations, whether on one machine or on a cluster, because in all cases resources are limited. Let’s look at two examples:
Wikipedia has about 30,000,000 pages (as of January 2014), and if the average number of links per page is assumed to be around eight, which was (according to S. N. Sorogovstev, p.8) the estimated number of hyperlinks per page in the WWW in 1999, we would need around 230MB to store each link as a byte in the adjacency matrix. To track the dynamics of such a network over five years based on daily snapshots of the network links, one would need around 407GB, but that’s without content – no words, images, edit history, or Java objects. Instead, we could calculate a correlation matrix — which contains the correlation link strength for all node pairs — from multiple node properties, but that would require around 410TB per snapshot (and 145PB for a full year of them!).
The social graph used by Facebook has about 1 billion input vectors with 100 features. Facebook claims, among other things, that its “performance and scalability on a 1-trillion edge social graph is two orders of magnitude beyond that scale of other public benchmarks.”
These massive numbers are cause for concern with respect to representation — especially in Hadoop, where data is distributed across many nodes. Whether the graph is stored as a file in Apache Hive, or even in Apache HBase tables, this consideration is critical because not all file formats can be split.
Usually a graph is presumed to be a matrix. If a link has more than one property or the nodes have more than one interaction mode (or multiple connectivity), tensor representation is necessary. Sometimes the nodes are of different types; such networks are called n-partite networks, and if multiple link types between the nodes are possible, multiplex networks – in which each individual link type defines a single network layer. Another option is a “hyper-graph” containing hyper-edges, which are characterized by the three or more nodes they tie together.
Other appropriate representations — like adjacency lists, or node and edge lists – are available. Such graph representations differ from the fundamental matrix representation. They often have a more efficient memory footprint; compressed link lists, for example, do not contain any values for non-existent links. In the case of a matrix, even a link strength of zero would require a stored value. Node identifiers can efficiently be encoded, for example with Huffman coding. Instead of the complete, sometimes quite long, node name like a URI, which is often used in semantic graphs only a number is used during computation, which also lowers memory requirements. In other cases, like in a tabular representation of a vertex cut, which is used in GraphX or Apache Spark, the data model contains additional information about the neighborhood of a node.
This brings us back to our example. We want to analyze how information about the Hadoop market is covered or represented in Wikipedia. Instead of working with the full page graph, I downloaded the neighborhood graphs, shown in Figure 1, wherein a) all languages are presented in an individual color, and b) highlights the local neighborhood. This means that all pages in the same language, like the initial page in English, are shown in group LN (green), while all remaining pages in all other languages are shown in the global neighborhood GN (blue).
Figure 1. Preparation of interconnected local networks for multilingual Wikipedia analysis.
The dark-colored nodes have a special meaning: The green one is our initial page, called central node CN. This node has so-called inter-wiki links to nodes in other languages, which describe the same semantic concept. All dark nodes form the core and all nodes in light colors are the “hull” of the local network.
l personally like the comparison with the model of an atom. Just as multiple atoms form molecules, several such local networks form a neighborhood network.
Such an intermediate result is visualized with Gephi in Figure 2. The neighborhood network for “Hadoop market” consists of manually selected pages (a set of 26 central nodes) and also includes their local and global neighborhoods, which means all direct linked pages in all Wikipedia languages are available.
Figure 2. The local neighborhood network for Wikipedia pages, which represents the Hadoop market, shows three major cluster: the green nodes on the top are all pages about open-source software projects. Companies in the Hadoop market are shown in the middle, and a cluster that contains more fundamental topics, like Java programming, is shown at the bottom.
This approach has two major advantages:
We can reduce the amount of data we have to handle without losing the embedding of the pages, and
We can easily separate and compare sub-data sets by language for language-dependent analysis.
But we have to be careful: a PageRank computation on this graph would give a different result compared to the result obtained from a full Wikipedia link graph. Comparing the global (full page graph) and local ranking (combined neighborhood graph) would tell us how the extraction of the local neighborhood network influences the results. This process is similar to an external disturbance of the system — one can expect that the absolute values of each node PageRank will differ in both networks, but the ranked list of PageRanks should not differ much (otherwise, the results may not be meaningful).
Now we have an approach that allows efficient extraction of sub-data sets. (Network nodes like Wikipedia pages have many different facets.) We can measure the amount of information on each page (text volume or number of words or sentences), the number of images, or links to other pages. Even a more dynamic view is possible; we just have to create correlation networks from user activity data, like access time series or edit history. All this data should be part of the data set, even if it is not used every time in every computation.
Visualization of Time-Dependent Multilayer Graphs
Before we get into more Giraph-related coding details, we need to add some tools to our graph-analysis workbench for quickly analyzing intermediate results and handling multilayer graphs. (Fortunately, such multilayer networks can be stored in Hive tables using partitions for each layer or for each temporal snapshot.)
We need multiple tables or multiple queries to extract edge and node data separately from one table. One sub-data set contains all the node properties of a neighborhood network. The links are also organized in multiple sub-data sets, one per link type. We can also store all links in just one table, but in this case each link type needs a classifier for filtering the list. (In the case of time-dependent networks, I recommend partitioned tables – this makes loading data into Gephi much faster.)
Figure 3. Gephi visualization of a correlation network.
Both features — semantic graph metadata management and “one-click” import from Hadoop — are implemented in the Gephi-Hadoop-Connector (see Figure 4). The Hive Metastore already stores some details about the data in our tables, like the table name, column names and types, the SerDe with its relevant parameters, and partitions. But in many research projects, one has to track yet more data that describes the meaning of a given table or even a column. Either for selection of filter criteria or for consistency checks, such information is highly relevant.
Figure 4. Metadata management for multiple time-dependent network layers is done via the Etosha plugin, which allows a comprehensive description of each single layer.
Here, we will store all facts about a data set as a page in a Semantic Media Wiki (SMW). This page allows human interaction, collaboration, and even machine access in parallel. Semantic annotations tell the Gephi-Hadoop-Connector all details about the multilayer network, such as which layer belongs to what network and which Hive or Impala query is required to load this data set from Hadoop. This information is retrieved via the query API of the SMW. (A future version of the Etosha Semantic Metastore, or ESM, will work with any type of triple store, which allows a more generic approach based on SPARQL queries.)
Our data-driven market study starts with an extraction of sub-data sets, which are defined by the local neighborhood graphs for a set of manually selected central nodes. We extract the access-rate and edit-event time series for those pages and calculate some characteristic values, like the local relevance index, based on access-rate data. This is shown in Figure 5, and the global relevance index for each page is shown in Figure 6. (Some more details about the analysis procedure are shown in this conference poster, presented during the Spring 2014 meeting of the DPG in Germany.)
Figure 5. Hadoop continuously attracts measurably more attention since 2009 until its “public” breakthrough in 2011 in English.
Figure 6. The increase of attention, people give to Hadoop related topics strongly depends on the language. In non-English Wikipedia projects, the Hadoop market is not that well recognized as in the English Wikipedia, but the increasing trend is clearly visible for Apache Solr (orange) and Hadoop (violet).
To summarize, our graph-analysis toolbox now contains:
Giraph 1.0
A CDH 4 development cluster in pseudo-distributed mode. We store data in Hive tables and the JDBC connector allows fast data access via Impala.
Gephi for visualization, on an external client machine. With the Gephi toolkit library, we can load and manipulate any graph file without showing a GUI. If we have to process a very large number of relatively small graphs, we can use this library within a map-only MapReduce job, but for large-scale processing, we will use Giraph.
Hands-On
Now, we can focus on practical things. First, we have to prepare the analysis workbench and then generate test data sets.
Two InputFormats for random graph generation are available in Giraph. Let’s calculate the PageRank for same random sample graphs.
Build and Deploy Giraph 1.1.0
1. Clone giraphl from Github into directory /home/cloudera/GIRAPH.
2. Check and if necessary modify settings:
3. Run the “bootstrap giraph script” in ./giraphl/bin/bsg.sh:
4. The build procedure will take some time. After about three to 10 minutes you should see the result:
5. Start a Giraph benchmark with the following command:
and it should give a comparable output:
Congratulations, you have successfully built Giraph (branch: trunk) and tested it for CDH!
Generating Random Graphs with Giraph
Graph classification can be based on measurable properties. “How ordered or how random is the graph structure?” are important questions in graph analysis, and many researchers like to investigate the degree-distribution of any given graphs. Two networks with very different properties are shown in Figure 8 together with the corresponding degree distribution.
Giraph calculates several properties of a given graph, which is loaded before the computation starts, but the graph can be generated on the fly as well. Table 1 shows two VertexInputFormats that are used for random graph generation.
Table 1. VertexInputFormats are used as “on the fly” graph-generators.
Figure 8. Small-world network generated with WattsStrogatzVertexInputFormat
Figure 9. Random graph generated with PseudoRandomVertexInputFormat
Analysis of such a generated graph is done with one of many algorithms, and if the graph should only be stored, use the IdentityComputation with an appropriate EdgeOutputFormat (such as the GraphvizOutputFormat).
Because of a limitation in Gephi, it is necessary to install graphviz to plot graphs defined in DOT format. (OmniGraffle is another option that allows import of a subset of DOT data.) In order to use the generated output in graphviz, the graphivz tools have to be installed. In CentOS this is done via:
The following two examples illustrate how the graphs in Figures 8 and 9 can be created with Giraph. Both commands are available as functions in the giraphl bsg.sh script. For Giraph library paths, the variables GEX and GEC are defined as follows:
Generate and calculate PageRank for a random network, store it in DOT format:
Generate and calculate PageRank for a small-world network, store it in DOT format:
Conclusion
Now you have learned how to manage time-dependent multilayer networks in Hadoop using Giraph and other tools
To conclude this series we will compute the correlation matrix from access-rate time series
using Apache Crunch and then we calculate the PageRank with Giraph and GraphLab to compare results for
static networks, correlation networks and dependency networks created by two concurring
open source graph processing tools, which integrate well into the Hadoop ecosystem.
See you soon, and have fun with Giraph!
Mirko Kämpf is the lead instructor for the Cloudera Administrator Training for Apache Hadoop for Cloudera University.