Metabroadcast.com

let's try to improve atlas equivalence

2014-04-24

As you probably already know, last week was Away Day time for the entire Metabroadcast team. We had a really good time talking about how we might improve our processes and products and learning some Wing Chun base techniques. As Oli already explained in his latest post, in the morning we were split in small teams, each one focusing on a particular aspect of our working life and trying to find possible improvements.

I was proudly part of the Equivicats team, with awesome teammates: Tiff, Fred and James. The aim for us was to come up with some suggestions for improving the Atlas' Equivalence sub-system. In this post I'll take you through what the team discussed and how we think we can make Equivalence better.

equivalence, a short recap

Equivalence is the sub-system in Atlas which determines that two pieces of data from different sources represent the same content, i.e. are in some sense equivalent, and links them together.

Fred van den Driessche

I'm not going to explain how Equivalence works in detail, given that it's quite complex and Fred already wrote about it a couple of times. So I recommend you to read his posts in order to understand better what follows...

...did you read it? Awesome! As you can see what Equivalence tries to achieve is a very complex task. Ingesting metadata from a heterogenous set of sources (each one with its own data structure and conventions) and stating, in a completely automated way, that some piece of content (i.e. a TV series episode) is equivalent to another is very, very hard. Even if we are quite happy with how Equivalence currently works, there's always room for improvements, and our desire is to make it better and better, so let's have a look at the suggestions that the team made...

the faster, the better

Atlas handles millions of pieces of content. As you can imagine, building the equivalence graph for a given piece of content is a very resource-heavy and time consuming process. The first step of the process, the candidate search, is where most of the computation occurs: it needs to find out which are the potential candidates for equivalence among a huge number of available pieces of content. We can improve the overall performance by distributing the most resource intensive parts of the computation across a cluster of workers, using map-reduce techniques or a realtime computation system. Hadoop and Storm are already present in our infrastructure, and the knowledge we've acquired working with these systems will be extremely useful for this.

being more precise

As soon as the equivalence candidates are selected, an equivalence score will be assigned to them, based to a number of heuristics (i.e. title similarity). A score is a number between -1 and 1 and only candidates with a score greater than a certain threshold will be considered for the rest of the process. Obviously it would be great to have a system that will say with absolute certainty that a piece of content is equivalent or not to another, but with so many variables in play it's simply not possible. It is therefore possible that a piece of content that has been stated to be equivalent to another one, actually isn't. The aim is to improve the precision of the candidate selection and the scoring steps.

This is a good opportunity for introducing some machine learning techniques. Turns out that our own James shines in this field and already started applying part of his knowledge to our systems, so he will certainly give an important contribution to this. One of the ideas is to create a training data set and implement algorithms that will be able to learn from it and therefore improve the accuracy of the candidate generation step.

Also, more heuristics for the score computation can be considered, using a wider subset of the data we have available. For example a new heuristic could consider the list of people associated to a piece of content (i.e. actors performing in a movie): a score will then be assigned to a piece of content based on how well its list of people matches the one associated with the target piece of content.

can humans help?

We've been talking of a completely automated process so far. Heuristics are used in order to tell, with a degree of certainty, that a piece of content is equivalent to another one. If there is a high degree of uncertainty in our heuristic, we might end up with saying that two pieces of content are equivalent to each other when they actually aren't. Can human eyes be helpful in taking the final decision in some cases? We think they might: for a human it will definitely be easier to say if two movies are the same or not, given a quick look at the available metadata for both. Therefore, we're planning to allow a human supervisor, who can explicitly define or break equivalence between pieces of content as needed.

monitoring the equivalence graph

Given our plans for human oversight, some tools will be needed in order to monitor the current equivalence graph . These will make it easier for the supervisor to understand what's going on and to determine if the whole equivalence graph actually makes sense or not. We already started working in this direction, with Oana currently building a nice and powerful web-based equivalence monitoring tool that will make our lives easier and should enable us to spot eventual issues as soon as possible.

any other suggestions?

Have you been involved in implementing as complex a task as equivalence? We'd love to hear your suggestions on how we can continue to improve it. Feel free to leave us a comment below or to Tweet us!