Last month I attended the CeBIT trade fair in Hannover. Besides the so called “shareeconomy” there was also another main topic across all expedition halls - Big Data. This subject is not completely new and I think that a lot of you also have experiences with some of the tools associated with Big Data. But due to the great number of databases, frameworks and engines in this field, there will always be something new to learn. So two weeks ago I started my own experiments with a graph database called Neo4j. This is one of the NoSQL databases, intended to distribute all of the computation across dozens of clusters in a fault-tolerant way. What attracted me was that I read that it is well suited for highly connected data and offers a descriptive language for querying the graph database. Roughly speaking, a graph database consists of nodes and edges connecting nodes. Both could also be enriched with properties. Some introduction which helped me can be found here and here. The graph query language “Cypher” then can be used to query the data by traversing the graph. Cypher itself is a declarative “Pattern-Matching” language and should be easily understandable for all folks familiar with SQL. There is a well arranged overview under this address.
If you look at my older posts, you will see that most of them are about spatial data or data with at least some spatial dimension. This kind of data often has some inherent relationships - for example streets connected in a street network, regions connected through some border, places visited by people and so on. Thus I decided to connect one of the most discussed use cases from Big Data - Recommendation/Recommender Systems - with an attractive dataset about the Location Based Social Network Foursquare I collected last year, for my first experiment with Neo4j.
The main plot behind this simple “Spatial Recommendation Engine” is to utilize public available check-in data to recommend users new types of places they never visited before. Such a “check-in” consists of a user ID, a place (called venue) and a check-in time plus additional information (venue type, ..). The following code will show the structure of the already preprocessed data:
The data was crawled last year as basis for an academic paper in the field of Urban Computing (which will be presented in May at the AGILE Conference on Geographic Information Science in Brussels) and contains public available check-ins for Germany. It seems to me, that such a kind of data is ideally suited for doing recommendations in a graph database and avoids the use of well-known toy datasets. The key idea behind our recommendation system is the following: Starting with a person for whom we want to make a recommendation, we will calculate the most similar users. A similar user is someone who rated venues in the same way the person of interest did. Because there is no explicit rating in the foursquare data, we take the number of visits as rating. The logic behind this is that either the person likes this place or it is important to him. So if both, the person and a user, will give high “ratings” to venues visited by both (thus both are similar), then the person may also be interested in visiting other venues highly rated by the other user, that the person has not seen yet. Technically speaking, this approach is called a collaborative filtering (calculate user similarity based on behavior) while the data collection is implicit (we have no explicit rating). Our data model therefore is straightforward: We take the venues and the users as nodes and transform all the related attributes from both into corresponding node properties. Then we connect every user node and venue node with a relationship if the user has visited this venue. The number of visits will be coded as a property of the relationship.
For the recommender system we will use a combination of R and Cypher statements, the second primarily for loading the data into Neo4j and traversing the graph. To send Cypher statements to Neo4j the REST-API is of great value. We then could use the great abilities of R to preprocess the data, catch the results and calculate the final recommendation list.
The following is a short overview of all the steps:
Extracting all relevant information (venues, users, ratings) from the check-in data
Loading the data into Neo4j
Calculating similarities for a specific user and making a recommendation on-the-fly
Plotting the results on a map
I assume that Neo4j is installed (it’s very simple - look here) and the graph database is empty. For this delete the “graph.db” directory. After this start Neo4j.
So our first step is to extract all venues, users and ratings from the check-in data.
The next thing is to import all that data into Neo4j. We will do this by generating dynamic Cypher statements to create all the nodes and relationships. This will of course take some time. If you have more data, then it’s maybe wiser to use the “Batch Importer”. But this needs more development and will not be explained here. Neo4j’s website offers a lot of possibilities to import data from various sources into the graph database. All of our Cypher statements will be sent to Nei4j via the “query” method, which I got from here.
So before we start with the recommender itself, I will discuss some of it’s the details. First part of the plan is to compute the similarities between a person and all other users, who also visited at least one of the venues the person did. Based on these similarities we will then determine the recommendations. This means, that we need a similarity measure first. In our case we will use the cosines similarity, a similarity measure typically used in text mining for high dimensional data (this also fits our case). A maximum value of 1 means that both users rated all venues they visited in the same way (“the profiles of both are similar”). If you calculate the similarity in the traditional way, you would first have to build up a feature table of the size \(m\) x \(n\) (\(m\) ~ number of users and \(n\) ~ number of venues) where a value \((i,j)\) represents the rating from user \(i\) about venue \(j\). This feature table would be huge and sparse because most users only visited a few venues. A graph is an efficient way to represent that, because only the ratings that already exist have to be encoded as explicit relationships.
After we choose a person for whom we want to compute recommendations, we start by calculating all of the relevant similarities. To get some more meaningful recommendations we exclude all venues related to the venue type “Travel & Transport”“ and only take those users into account, who have at least two visited venues in common with the chosen person. For the last part we have to use R because if I’m right, Neo4j is unable at the moment to carry out “Subselects”.
The second query then selects all venues (call them recommendation candidates) rated by similar users which are still not visited by the person. It returns the user ratings and the venue properties like name, type and the geographic coordinates.
The last step is to determine the top X recommendations. Therefore we compute a weighted (by the similarity between the user and the chosen person) rating for every recommendation candidate over all of the similar users that had already visited it and pick the top X venues as our recommendations.
We will close the coding section by a nice visualization of already visited (red) and recommended (blue) venues in a map. This gives a first impression on how the venues and the recommendations are distributed in geographic space.
Finally it is time to summarize what was done: We built a simple recommendation engine which could be used to recommend new places to users, which they should visit. The recommendation is based on their past behavior and can be computed in near real-time (there is no long running batch job). What we left out, is a rigor evaluation of the recommendation performance. But because recommendation only serves as a demo use case, this was not the topic of this posting.
More important I have to say, that I’m really impressed on how easy it is to set up a Neo4j graph database and how simple it is to make the first steps with the query language Cypher. The SQL style of Cypher makes the determination of the most similar users straightforward. What’s also interesting is the simplicity of the connection from R to Neo4j via the REST-Interface. No additional things are needed.
The “outlook” is even more promising due to the main advantages of NoSQL databases like fast operations on huge datasets or easy and fault-tolerant scalability (not shown here). Even though the operations run fast on that moderate sized dataset, a broad test lies beyond that session. But maybe in one of the next postings …