By Andy Rossback, Ivar Vong
There is an untapped value in a reporter’s endorsement of an article about something on their beat. They cite it. They post it to Twitter. They send it to colleagues. What if you could capture every important article read by reporters across your company—then categorize them by topic, place, and person?
It would become much easier for a college student to write a term paper about juvenile justice. Or for a reporter on deadline to file a context-heavy piece about a jail suicide epidemic in their community. Or for an academic to find secondary source material about prosecutorial misconduct. Or for a reader to fully understand the broader issues behind an unfolding news story out of San Bernardino, California.
So we’re introducing The Record, the Marshall Project’s compendium of reporter-curated criminal justice links. As a small newsroom, we knew from the beginning that aggregation would be an important part of our national report. The majority of our daily newsletter is aggregation. It seemed useful to try to collect all the work reporters do every day, rather than let those tabs close silently.
We hope organizing these links helps people cut through the volume of the daily news cycle and see these topics beyond the individual story.
Collecting Structured Data
The Record is powered by a simple tool we built in 2014—called Gator, short for aggregator—to collect stories being read by reporters during the course of their workday. It’s a bookmarklet, similar to Delicious. The reporter inputs each story and adds a few tags.
Instead of free-form tagging, we used a centralized tag database, which reporters can add to on the fly. Those same tags can be applied to any other content on the site, including our own posts.
To date, we’ve collected more than 14,000 links, organizing them among 2,500 tags.
We’ve already been using this tool to power our morning newsletter—anyone in the office can add a link, and in the morning, a drag-and-drop tool flows them into the template. This keeps formatting consistent in the newsletter template.
Now, we’re opening up this repository to the public as The Record. Readers, along with students, academics, reporters, and others can access these topic pages carrying Marshall Project original content as well as the best of the web collected by our reporters.
How The Record Works
Homepage
The homepage is the entry point to The Record. There are two pieces, which we hope turn a bit of curiosity into a deeper look at a topic.
Of the dozen or so links created every day, we show the tags that have had the most links added in the past week, and the tags most recently changed. Our staff (especially Andrew Cohen!) does a pretty incredible job of reading hundreds of news sites, so this represents our best summary of what’s going on in criminal justice news.
The second way into The Record is search. The first search strategy is partial matching against tag names. The second uses the full text of the stories, stored in Elasticsearch, to suggest tags related to a general query. (Elasticsearch is a document database that excels at full text search on semi-structured data.)
For example, when the query “kids” is sent to Elasticsearch, it returns a set of the model IDs and associated scores. We then pull the links from the primary database, Postgres, along with the reporter-created tags. The query returns a couple dozen links, each with several tags. We roll those up to find the most popular tags.
While admittedly a relatively crude method, we’re excited to improve it as we get some live searches. It should go without saying: Taxonomies and search engines are very complicated.
Topic Pages
Topic pages list Marshall Project original content, links from around the web, as well as related topics. Although, we may extend the type of content on each page to include maps or writethroughs related to the subject or storyline.
Creating structured data upfront makes it easier to query the aggregated links on these pages. The “Related Tags” module uses this—the number of links common to two tags acts as the weight in the graph—to retrieve similar tags. The “tagging” model is a has_many :through relation that creates an edge between a tag and a taggable—for us, taggables are either links or posts. As we experiment with the tag graph and our link metadata, we’re likely to move some of these computations to background jobs.
Links within a tag page can be sorted by “Popular” or “Recent.” Popular is measured by total Facebook share count. Recent is measured by when it was added. We investigated using “published time” instead, but looking at the structured data provided by these 600+ domains, we haven’t found a good solution for sorting by published date.
We use Sidekiq as our background job queue throughout our Ruby on Rails–based CMS. To update link popularity, we periodically pull the share count in a job. This is done more frequently in the first hours, then backed off over time. This data is stored directly on the link model, though we’d like to expand this. Specifically, we’re interested in measuring the change in share count over time and using the velocity of individual links to understand broader topic shifts.
Some of the topics have many links within them, too many to render these pages on the server performantly. We need a stable sort order to be able to paginate. The first complexity comes from a desire to include multiple models in a single feed, and the second comes from which property we’re sorting on. Initially, we’re sorting on data stored on the models. But in the future, as we combine data sources, we won’t be able to generate these on-the-fly quickly enough.
Our solution is to create lightweight “slices” that allow fast offset and limit querying. A slice is made up of a tag, a sort order, and a set of models. A background worker generates an array where each item is the model and its score. This looks something like [[“link:355”, 2649],[“link:184”, 7593]]. These are marshaled to JSON and persisted in memcached. When this worker runs again, we have the added benefit of atomic updates for the slice.
When a slice is needed, it’s unmarshaled, the offset and limit are applied to find the subset needed. These are queried from ActiveRecord and passed to a presenter. We use Mustache (just for these) so the presenter’s output can be passed directly to the template on the server, or marshaled to JSON for the client.
Performance
We use a few strategies to keep the pages fast:
Precompute: Generate expensive resources in background jobs, triggered by underlying model changes, reducing lookup from O(n) to O(1).
Granular Caching: Model caching and view partial caching—based on “Russian Doll Caching”—improves the median response significantly.
Public caching: For common endpoints (the app is hosted on Heroku, behind Fastly).
All of these work together to keep the minimum amount of work in the critical render path while still providing the benefits of server-generated HTML.
What We Learned
There have been a few challenges, some easier to overcome than others:
Open Graph
We’ve learned just how inconsistent some publications are with Facebook metadatas since we pull headlines and description from these. Likewise, anything that isn’t HTML (like PDFs, MP3s, etc.) stores metadata differently.
URL Canonicalization
We initially preferred the <link rel="canonical"> tag, though we’re now using the Facebook og:url, followed by the <link>, then the given URL. Without this, people create duplicates of the same link, rendering the data much less useful.
Similar Tags
Allowing tag creation company-wide has meant many duplicate tags (ie. FBI and Federal Bureau of Investigation). We don’t want to simply roll them up because they would potentially be recreated again. And they should be related, anyway. Because most stories have both, we will eventually be able to identify similar tags. We’re also interested in adding hierarchical structure to understand relationships between topics.
Dead Links
As troubling as it is, stories disappear. We periodically rescrape links and mark whether or not they return an HTTP error. Redirects are also a bit tricky, especially when they point to another link we do have. We haven’t quite figured this out yet.
Performance
It can always be better.
The Future
There are a lot of possibilities, but here’s just a few we’ve been thinking about.
Engagement
What if others journalists or citizens could help us find criminal justice links? Is there added significance when a story has been submitted many times? Could we build games around which links people find? Could we let readers create their own newsletters?
Data
Measure The Record’s users and make product decisions based on that. What are people searching for? Clicking on? Can this be a signal for story popularity?
Distribution
What if we had a feed or a Twitter bot that tweeted each time a new link was added? What if you could add packages of links showing the reporting trail to the bottom of a big investigation?
Analysis
Could we build bots that look at the other content touching these links, like on Twitter or in other emails? What kind of visualizations could be created from what we’ve learned about the importance of topics over time, and the frequency of their appearance in media?
We hope this is helpful for others who are thinking about how they can use structured data or aggregation to improve the depth and breadth of their journalism.