(This post is an aspirational transcript of the talk I gave to the Data Terra Nemo conference in May 2015. If you’d like to watch the less eloquent version of the same talk that I actually gave, the video should be available soon!)
I’ve been working on building a decentralized GitHub, and I’d like to talk about what this means and why it matters — and more importantly, show you how it can be done and real GitTorrent code I’ve implemented so far.
Why a decentralized GitHub?
First, the practical reasons: GitHub might become untrustworthy, get hacked — or get DDOS’d by China, as happened while I was working on this project! I know GitHub seems to be doing many things right at the moment, but there often comes a point at which companies that have raised $100M in Venture Capital funding start making decisions that their users would strongly prefer them not to.
There are philosophical reasons, too: GitHub is closed source, so we can’t make it better ourselves. Mako Hill has an essay called Free Software Needs Free Tools, which describes the problems with depending on proprietary software to produce free software, and I think he’s right. To look at it another way: the experience of our collaboration around open source projects is currently being defined by the unmodifiable tools that GitHub has decided that we should use.
So that’s the practical and philosophical, and I guess I’ll call the third reason the “ironical”. It is a massive irony to move from many servers running the CVS and Subversion protocols, to a single centralized server speaking the decentralized Git protocol. Google Code announced its shutdown a few months ago, and their rationale was explicitly along the lines of “everyone’s using GitHub anyway, so we don’t need to exist anymore”. We’re quickly heading towards a single central service for all of the world’s source code.
So, especially at this conference, I expect you’ll agree with me that this level of centralization is unwise.
Isn’t Git already decentralized?
You might be thinking that while GitHub is centralized, the Git protocol is decentralized — when you clone a repository, your copy is as good as anyone else’s. Isn’t that enough?
I don’t think so, and to explain why I’d like you to imagine someone arguing that we can do without BitTorrent because we have FTP. We would not advocate replacing BitTorrent with FTP, and the suggestion doesn’t even make sense! First — there’s no index of which hosts have which files in FTP, so we wouldn’t know where to look for anything. And second — even if we knew who owned copies of the file we wanted, those computers aren’t going to be running an anonymous FTP server.
Just like Git, FTP doesn’t turn clients into servers in the way that a peer-to-peer protocol does. So that’s why Git isn’t already the decentralized GitHub — you don’t know where anything’s stored, and even if you did, those machines aren’t running Git servers that you’re allowed to talk to. I think we can fix that.
Let’s GitTorrent a repo!
Let’s jump in with a demo of GitTorrent – that is, cloning a Git repository that’s hosted on BitTorrent:
Hey everyone: we just cloned a git repository over BitTorrent! So, let’s go through this line by line.
Lines 1-2: Git actually has an extensible mechanism for network protocols built in. The way it works is that my git clone line gets turned into “run the git-remote-gittorrent command and give it the URL as an argument”. So we can do whatever we want to perform the actual download, and we’re responsible for writing git objects into the new directory and telling Git when we’re done, and we didn’t have to modify Git at all to make this work.
So git-remote-gittorrent takes it from here. First we connect to GitHub to find out what the latest revision for this repository is, so that we know what we want to get. GitHub tells us it’s 5fbfea8de...
Lines 4-6: Then we go out to the GitTorrent network, which is a distributed hash table just like BitTorrent’s, and ask if anyone has a copy of commit 5fbdea8de... Someone said yes! We make a BitTorrent connection to them. The way that BitTorrent’s distributed hash table works is that there’s a single operation, get_nodes(hash) which tells you who can send you content that you want, like this:
Now, in standard BitTorrent with “trackerless torrents”, you ask for the files that you want by their content, and you’d get them and be happy. But a repository the size of the Linux kernel has four million commits, so just receiving the one commit 5fbdea8de.. wouldn’t be helpful; we’d have to make another four million requests for all the other commits too. Nor do we want to get every commit in the repository every time we ‘git pull’. So we have to do something else.
Lines 8-12: Git has solved this problem — it has this “smart protocol format” for negotiating an exchange of git objects. We can think of it this way:
Imagine that your repository has 20 commits, 1-20. And the 15th commit is bbbb and the most recent 20th commit is aaaa. The Git protocol negotiation would look like this:
Because of the way the git graph works, node 1> here can look up where bbbb is on the graph, see that you’re only asking for five commits, and create you a “packfile” with just those objects. Just by a three-step communication.
That’s what we’re doing here with GitTorrent. We ask for the commit we want and connect to a node with BitTorrent, but once connected we conduct this Smart Protocol negotiation in an overlay connection on top of the BitTorrent wire protocol, in what’s called a BitTorrent Extension. Then the remote node makes us a packfile and tells us the hash of that packfile, and then we start downloading that packfile from it and any other nodes who are seeding it using Standard BitTorrent. We can authenticate the packfile we receive, because after we uncompress it we know which Git commit our graph is supposed to end up at; if we don’t end up there, the other node lied to us, and we should try talking to someone else instead.
So that’s what just happened in this terminal. We got a packfile made for us with this hash — and it’s one that includes every object because this is a fresh clone — we downloaded and unpacked it, and now we have a local git repository.
This was a git clone where everything up to the actual downloading of git objects happened as it would in the normal GitHub way. If GitHub decided tomorrow that it’s sick of being in the disks and bandwidth business, it could encourage its users to run this version of GitTorrent, and it would be like having a peer to peer “content delivery network” for GitHub, falling back to using GitHub’s servers in the case where the commits you want aren’t already present in the CDN.
Was that actually decentralized?
That’s some progress, but you’ll have noticed that the very first thing we did was talk to GitHub to find out which hash we were ultimately aiming for. If we’re really trying to decentralize GitHub, we’ll need to do much better than that, which means we need some way for the owner of a repository to let us know what the hash of the latest version of that repository is. In short, we now have a global database of git objects that we can download, but now we need to know what objects we want — we need to emulate the part of github where you go to /user/repo, and you know that you’re receiving the very latest version of that user’s repo.
So, let’s do better. When all you have is a hammer, everything looks like a nail, and my hammer is this distributed hash table we just built to keep track of which nodes have which commits. Very recently, substack noticed that there’s a BitTorrent extension for making each node be partly responsible for maintaining a network-wide key-value store, and he coded it up. It adds two more operations to the DHT, get() and put(), and put() gives you 1000 bytes per key to place a message into the network that can be looked up later, with your answer repeated by other nodes after you’ve left the network. There are two types of key — the first is immutable keys, which work as you might expect, you just take the hash of the data you want to store, and your data is stored with that hash as the key.
The second type of key is a mutable key, and in this case the key you look up is the hash of a public key to a crypto keypair, and the owner of that keypair can publish signed updates as values under that key. Updates come with a sequence number, so anytime a client sees an update for a mutable key, it checks if the update has a newer sequence number than the value it’s currently recorded, and it checks if the update is signed by the public key corresponding to the hash table key, which proves that the update came from the key’s owner. If both of those things are true then it’ll update to this newer value and start redistributing it. This has many possible uses, but my use for it is as the place to store what your repositories are called and what their latest revision is. So you’d make a local Git commit, push it to the network, and push an update to your personal mutable key that reflects that there’s a new latest commit. Here’s a code description of the new operations:
So now if I want to tell someone to clone my GitHub repo on GitTorrent, I don’t give them the github.com URL, instead I give them this long hex number that is the hash of my public key, which is used as a mutable key on the distributed hash table.
Here’s a demo of that:
In this demo we again cloned a Git repository over BitTorrent, but we didn’t need to talk to GitHub at all, because we found out what commit we were aiming for by asking our distributed hash table instead. Now we’ve got true decentralization for our Git downloads!
There’s one final dissatisfaction here, which is that long strings of hex digits do not make convenient usernames. We’ve actually reached the limits of what we can achieve with our trusty distributed hash table, because usernames are rivalrous, meaning that two different people could submit updates claiming ownership of the same username, and we wouldn’t have any way to resolve their argument. We need a method of “distributed consensus” to give out usernames and know who their owners are. The method I find most promising is actually Bitcoin’s blockchain — the shared consensus that makes this cryptocurrency possible.
The deal is that there’s a certain type of Bitcoin transaction, called an OP_RETURN transaction, that instead of transferring money from one wallet to another, leaves a comment as your transaction that gets embedded in the blockchain forever. Until recently you were limited to 40 bytes of comment per transaction, and it’s been raised to 80 bytes per transaction as of Bitcoin Core 0.11. Making any Bitcoin transaction on the blockchain I believe currently costs around $0.08 US cents, so you pay your 8 cents to the miners and the network in compensation for polluting the blockchain with your 80 bytes of data.
If we can leave comments on the blockchain, then we can leave a comment saying “Hey, I’d like the username Chris, and the hash of my public key is
“, and if multiple people ask for the same username, this time we’ll all agree on which public key asked for it first, because blockchains are an append-only data structure where everyone can see the full history. That’s the real beauty of Bitcoin — this currency stuff is frankly kind of uninteresting to me, but they figured out how to solve distributed consensus in a robust way. So the comment in the transaction might be:
It’s interesting, though — maybe that “gittorrent” at the beginning doesn’t have to be there at all. Maybe this could be a way to register one username for every site that’s interested in decentralized user accounts with Bitcoin, and then you’d already own that username on all of them. This could be a separate module, a separate software project, that you drop in to your decentralized app to get user accounts that Just Work, in Python or Node or Go or whatever you’re writing software in. Maybe the app would monitor the blockchain and write to a database table, and then there’d be a plugin for web and network service frameworks that knows how to understand the contents of that table.
It surprised me that nothing like this seems to exist already in the decentralization community. I’d be happy to work on a project like this and make GitTorrent sit on top of it, so please let me know if you’re interested in helping with that.
By the way, username registration becomes a little more complicated than I just said, because the miners could see your message, and decide to modify it before adding it to the blockchain, as a registration of your username to them instead of you. This is the equivalent of going to a domain name registrar and typing the domain you want in their search box to see if it’s available — and at that moment of your search the registrar could turn around and register it for themselves, and then tell you to pay them a thousand bucks to give it to you. It’s no good.
If you care about avoiding this, Bitcoin has a way around it, and it works by making registration a two-step process. Your first message would be asking to reserve a username by supplying just the hash of that username. The miners don’t know from the hash what the username is so they can’t beat you to registering it, and once you see that your reservation’s been included in the blockchain and that no-one else got a reservation in first, you can send on a second comment that says “okay, now I want to use my reservation token, and here’s the plain text of that username that I reserved”. Then it’s yours.
(I didn’t invent this scheme. There’s a project called Blockname, from Jeremie Miller, that works in exactly this way, using Bitcoin’s OP_RETURN transaction for DNS registrations on bitcoin’s blockchain. The only difference is that Blockname is performing domain name registrations, and I’m performing a mapping from usernames to hashes of public keys.)
So to wrap up, we’ve created a global BitTorrent swarm of Git objects, and worked on user account registration so that we can go from a user experience that looks like this:
to this:
to this:
And at this point I think we’ve arrived at a decentralized replacement for the core feature of GitHub: finding and downloading Git repositories.
Closing thoughts
There’s still plenty more to do — for example, this doesn’t do anything with comments or issues or pull requests, which are all very important aspects of GitHub.
For issues, the solution I like is actually storing issues in files inside the code repository, which gives you nice properties like merging a branch means applying both the code changes and the issue changes — such as resolving an issue — on that branch. One implementation of this idea is Bugs Everywhere.
We could also imagine issues and pull requests living on Secure Scuttlebutt, which synchronizes append-only message streams across decentralized networks.
I’m happy just to have got this far, though, and I’d love to hear your comments on this design. The design of GitTorrent itself is (ironically enough) on GitHub and I’d welcome pull requests to make any aspect of it better.
I’d like to say a few thank yous — first to Feross Aboukhadijeh, who wrote the BitTorrent libraries that I’m using here. Feross’s enthusiasm for peer-to-peer and the way that he runs community around his “mad science” projects made me feel excited and welcome to contribute, and that’s part of why I ended up working on this project.
I’m also able to work on this because I’m taking time off from work at the moment to attend the Recurse Center in New York City. This is the place that used to be called “Hacker School” and it changed its name recently; the first reason for the name change was that they wanted to get away from the connotations of a school where people are taught things, when it’s really more like a retreat for programmers to improve their programming through project work for three months, and I’m very thankful to them for allowing me to attend.
The second reason they decided to change their name because their international attendees kept showing up at the US border and saying “I’m here for Hacker School!” and.. they didn’t have a good time.
Finally, I’d like to end with a few more words about why I think this type of work is interesting and important. There’s a certain grand, global scale of project, let’s pick GitHub and Wikipedia as exemplars, where the only way to have the project be able to exist at global scale after it becomes popular is to raise tens of millions of dollars a year, as GitHub and Wikipedia have, to spend running it, hoarding disks and bandwidth in big data centers. That limits the kind of projects we can create and imagine at that scale to those that we can make a business plan for raising tens of millions of dollars a year to run. I hope that having decentralized and peer to peer algorithms allows us to think about creating ambitious software that doesn’t require that level of investment, and just instead requires its users to cooperate and share with each other.
Thank you all very much for listening.