Sujitpal.blogspot.com

Querying UMLS Concepts in Neo4j with Cypher

2014-03-07

UMLS Concepts are connected to each other by relationships. Conceptually, this structure is a large directed graph with 2.8M nodes and 51.7M relationships. Using a graph database such as Neo4j makes a lot of sense because we can use Neo4j's native query interface, the Cypher query language, to query this data - as a result we have an almost invisible user interface that is infinitely extensible (limited only by Cypher's own capabilities). In this post, I describe my use of Cypher (via Scala, using Neo4j's Java REST API) to build some navigational services to my UMLS based taxonomy.

As I mentioned last week, I loaded up the UMLS concept and relationship data using the batch-import tool. I realized after I did so that I would like an extra semantic codes field in my concept data, and that I would like to get rid of loops in my relationship data. So I downloaded the data from the UMLS database again using the following SQL calls.

I then ran a slightly modified version of the syns_aggregator_job.py to roll up synonyms as well as semantic types by CUI (resulting in a record represented by the case class Concept in the code below). Since the rels_filter_job.py was unnecessary, I didn't run it this time. I then added the headers as described in my previous post and reran the batch-import tool.

My objective was to implement four navigational services that are currently used with our memory based taxonomy front end. For each one I first tried out the Cypher query in the Neo4j shell to make sure it worked, then I implemented the service as a method in a Scala class as shown below. This blog post helped me to figure out how to use the Neo4j Java REST API.

The services implemented provide functionality to get a concept by its CUI, to list the unique outgoing relationships from a CUI, to list the CUIs of related concepts for a given CUI and relationship and to find the path between two concepts specified by their CUIs. All the services use the Lucene index "concepts" that was created during the batch import process to look up nodes by CUI rather than the internal Neo4j nodeID.

I had to add the following dependencies to my build.sbt to get this to compile and run.

To execute the services, I used the JUnit test shown below. Results are inlined with the code to make it easy to see.

Cypher looks a bit different than most high level query languages (at least it looked different to me), but there are good resources for learning Cypher such as this one. Cypher makes accessing the data in the graph really simple. It also opens the door for some graph based analytics in the future. Also scaling wise, since the taxonomy is read only (the data is maintained via a CRUD webapp in a RDBMS and exported periodically), the REST API enables us to balance load across multiple replica graph servers.