Ouseful.wordpress.com

Data Journalism Units on Github

2017-01-25

Working as I do with an open notebook (this blog, my github repos, pinboard and twitter), I value works shared by other people too. Often, this can lead to iterative development, as one person sees an opportunity to use someone else’s work for a slightly different purpose, or spots a way to improve upon it.

A nice example of this that I witnessed in more or less realtime a few years ago was when data journalists from the Guardian and the Telegraph – two competing news outlets – bounced off each others’ work to produce complementary visualisations demonstrating electoral boundary changes (Data Journalists Engaging in Co-Innovation…). (By the by, boundary changes are up for review again in 2018 – the consultation is still open.)

Another example comes from when I starting to look for cribs around building virtual machines to support OU course delivery. Specifially, the Infinite Interns idea for distinct (and disposable) virtual machines that could be used to support data journalism projects (about).

Today, I variously chanced across a couple of Github repositories containing data, analyses, toolkits and application code from a couple of different news outlets. Github was originally developed as a social coding environment where developers could share and collaborate on software projects. But over the last few years, it’s also started to be used to share data and (text) documents, as well as reproducible data analyses – and not just by techies.

A couple of different factors have contributed to this, I think, that relate as much to how Github can be used to preview and publish documents, as act as a version control and issue tracking system:

markdown documents (suffix .md) are rendered when files uploaded to a repository are viewed on Github; (markdown is just text, with natural forms of emphasis, such as wrapping text you want to emphasis with *’s, or prefixing list items written on separate lines with a – character, being rendered automatically); Github will also render Jupyter notebook (.ipynb) files, and display .geojson files as embedded, interactive maps (and suggest map edits); (new to me, you can also view the differences between SVG images); this works equally well when you upload files to the gist.github.com scribblepad;

if you create a docs/ folder at the top level of a repository, you can put files into it that will be published via yourusername.github.io/yourRepository – or you can provide your own URL to serve that content (howto).

Admittedly, using git and Github can be really complicated and scary, but you can also use it as a place to pop documents and preview them or publish them, as described above. And getting files in is easy too – just upload them via the web interface.

Anyway, that’s all by the by… The point of this post was to try to pull together a small collection of links to some of the data journalism units I’ve spotted sharing stuff on Github, and see to what extent they practice “reproducible data journalism”. (There’s also a Github list – Github showcase – open journalism.) So for example:

one of the first news units I spotted sharing research in a reproducible way was BuzzFeedNews and their tennis betting analysis. A quick skim of several of the repos suggest they use a similar format – a boilerplate README with a link to the story, the data used in the analysis, and a Jupyter notebook containing python/pandas code to do the analysis. They also publish a handy directory to their repos, categorised as Data and Analyses, Standalone Datasets, Libraries and Tools, Guides… I’m liking this a lot…

fivethirtyeight: There are also a few other data related repos at the top level, eg guns-data. Hmm… Data but no reproducible analyses?

SRF Data – srfdata (data-driven journalism unit of Swiss Radio and TV): several repos containing Rmd scripts (in English) analysing election related data. More of this, please…

FT Interactive News – ft-interactive: separate code repos for different toolkits (such as their nightingale-charts chart tools) and applications; a lot of the applications seem to appear in subscriber only stories – but I can you can try to download the code and run it yourself… Good for sharing code, poor for paywall stopping sharing of executed examples;

New York Times – NYTimes: plenty of developer focussed repos, although the gunsales repo creates an R package that works with a preloaded dataset and routines to visualise the data and the ingredient phrase tagger is a natural language parser trained to tag food recipe components. (Makes me wonder what other sorts of trained taggers might be useful…) One for the devs…

Washington Post – washingtonpost: more devops repos, they they have also dropped a database of shootings (as a CSV file) as one of the repos (data-police-shootings)). I’d hoped for more…

NYT Newsroom Developers: another developer focussed collection of repos, though rather than focussing on just front end tools there are also scrapers and API helpers. (It might actually be worth going through all the various news/media repos to build a metalist/collection of API wrappers, scrapers etc. i.e. tools for sourcing data). I’d half expected to see more here, too…?

Wall Street Journal Graphics Team – WSJ: not much here, but picking up on the previous point there is this example of a AP ballot API wrapper; Sparse…

The Times / Sunday Times – times: various repos, some of the link shares; the data one collects links to a few datasets and related stories. Also a bit sparse…

The Economist – economist-data-team: another unloved account – some old repos for interactive HTML applications; Another one for the devs, maybe…

BBC England Data Unit – BBC-Data-Unit: a collection of repositories, one per news project. Recent examples include: Dog Fights and Schools Chemical Alerts. Commits seem to come from a certain @paulbradshaw… Repos seem to include a data file and a chart image. How to create run the analysis/create the chart from the data is not shared… Could do better…

From that quick round up, a couple of impressions. Firstly, BuzzFeedNews seem to be doing some good stuff; the directory listing they use that breaks down different sorts of repos seems sensible, and could provide the basis for a more scholarly round up than the one presented here. Secondly, we could probably create some sort of matrix view over the various repos from different providers, that would allow us, for example, to see all the chart toolkits, or all the scrapers, or all the API wrappers, or all the election related stuff.

If you know of any more I should add to the list, please let me know via the comments below, ideally with a one or two line summary as per the above to give a flavour of what’s there…

I’m also mindful that a lot of people working for the various groups may also be publishing to personal repositories. If you want to suggest names for a round up of those, again, please do so via the comments.

PS I should really be recording the licenses that folk are releasing stuff under too…