npm has only been a company for 3 years, but it has been a code base for around 5–6 years. Much of it has been rewritten, but the cores of the CLI and registry are still the original code. Having only worked at npm for a year at this point, there’s still a lot of things left for me to learn about how the whole system works.
Sometimes, a user files a bug which, in the process of debugging it, teaches you some things you didn’t know about your own system. This is the story of one of those bugs.
The Bug
Over the past week or so, several people filed issues regarding some strange truncating in npm package pages. In one issue, a user reported what appeared to be a broken link in their README:
Another user pointed out that the entire end portion of their README was missing!
As a maintainer of npm’s markdown parser, marky-markdown, I was concerned that these issues were a result of a parsing rule gone awry. However, another marky-markdown maintainer, @revin, quickly noted something odd: the description was cut off at exactly 255 characters, and the README was cut off at exactly 64kb. As my colleague @aredridel pointed out: those numbers are smoking guns.
Indeed, an internal npm service called registry-relational-follower was truncating both the READMEs and descriptions of packages published to the npm registry. This was a surprise to me and my colleagues, so I filed an issue on our public registry repo. In nearly no time at all, our CTO @ceejbot responded by saying that this truncation was intended behavior(!) and closed the issue.
“TIL!” I thought. And that’s when I decided to dig into how the registry handles READMEs… and why.
The Zero One Infinity Rule
Before I dive into exactly what happens to your packages’ READMEs between your writing & publishing to their rendering on the npm website, let’s address the 800-lb gorilla in the room:
When I discovered that the registry was doing arbitarily truncating READMEs, I thought: “Seems bad.”
Maybe you thought this, too.
Indeed, at least one other person did, commenting on the closed issue:
This may be desired by npm, but I doubt any package authors desire their descriptions to be truncated.
Also, see zero-one-infinity.
I should point out that commenting negatively on an already closed issue isn’t the best move in the world. However, I appreciated this comment, because it gave me new words to explain my own vaguely negative feelings about this truncation situation — fancy words with a nice name: The Zero One Infinity rule.
The Zero One Infinity rule is a guiding priniciple made popular by Dutch computer scientist Willem Van der Poel and goes as follows:
Allow none of foo, one of foo, or any number of foo.
—Jargon File
This principle stands to eliminate arbitrary restrictions of any kind. Functionally, it suggests that, if you are going to allow something at all, allow one thing or allow an inifinite amount of things. These seem to be aligned with a seemingly symbiotic rule: the Principle of Least Astonishment, which states:
If a necessary feature has a high astonishment factor, it may be necessary to redesign the feature.
In the end, these principles are fancy, important-sounding ways of saying: arbitrary restrictions are surprising, and we shouldn’t be surprising our users.
Now that we can agree that surprising users with strange and seemingly arbitrary restrictions is no bueno … why does the npm registry currently have this restriction? Certainly npm’s developers don’t want to be surprising developers, right?
An Archaeology of Registry Architecture
Indeed, they don’t! The current restriction on description and README size is a Band-Aid that npm’s registry developers were forced to apply as a result of the original architecture of the npm registry: large READMEs were making npm slow.
How the heck…, you might be thinking. Reasonable. Let’s take a look.
How npm Deals with READMEs on Publish
Currently, here is how your READMEs are dealt with by the registry:
When you type npm publish, the CLI tool takes a look at your .npmignore (or your .gitignore, if no .npmignore is present) and the files key of your package.json. Based on what it finds there, the CLI takes the files you intend to publish and runs npm pack, which packs everything up in a tarball, or .tar.gz file. npm doesn’t allow you to ever ignore the README file, so that gets packed up no matter what!
When you type npm publish, your README gets packed into a package tarball. This is what gets downloaded when someone npm installs your package. But this is not the only thing that happens with your README.
So while npm publish runs npm pack, it also runs a script called publish.js that builds an object containing the package’s metadata. Over the course of your package’s life (as you publish new versions), this metadata grows. First, read-package-json is run and grabs the content of your README file based on what you’ve listed in your package.json. Then publish.js adds this README data to the metadata for your package. You can think of this metadata as a more verbose version of your package.json — if you ever want to check out what it looks like, you can go to http://registry.npmjs.com/. For example, check out http://registry.npmjs.com/marky-markdown. As you’ll see, there’s README data in there for whichever version of your package has the latest tag!
Finally, publish.js sends this metadata, including your README, to validate-and-store… and here is where we bump into our truncation situation.
npm publish sends the entire README data to the registry, but the entire README does not get written to the database. Instead, when the database receives the README, it truncates it at 64kb before inserting.
This means: while we talk about a package on the npm registry as a single entity, the truth is that a single package is actully made up of multiple components that are dealt with by the npm registry services differently. Notably, there’s one service for tarballs, and another for metadata, and your README is added to both.
This means that the registry has 2 versions of your README:
- The original version as a file in the package tarball
- A potentially truncated version in the package metadata
As you may now be guessing, users have been seeing truncated READMEs on the npm website because the npm website uses the README data from package metadata. This makes a fair amount of sense: if we wanted to use the READMEs in the package tarballs, we’d have to unpack every package tarball to retrieve the README, and that would not be super efficient. Reading README data from a JSON response, which is how the npm registry serves package metadata, seems at least a little more reasonable than unpacking over 350,000 tarballs.
History Lesson Time
So now we know where the READMEs are truncated, and how those truncated READMEs are used — but it’s still not necessarily clear why. Understanding this requires a bit of archaeology.
Like many things about npm, this truncation was not always the case. On January 20, 2014, @isaacs committed the 64kb README truncation to npm-registry-couchapp, and he had several very good reasons for doing so:
First, allowing extremely large READMEs exposed us to a potential DDoS attack. An unsavory actor could automate publishing several packages with epically large READMEs and take down a bunch of npm’s infrastructure.
Second, extremely large READMEs in the package metadata were exploding the file size of that document, which made GET requests to retrieve package data very slow. Requesting the package metadata happens for every package on an npm install, so ostentisbly a single npm install could be gummed up in having to read several packages with very long READMEs — READMEs that wouldn’t even be useful to the end user, who would either use the unpacked README from the tarball or wouldn’t even need the README if, for example, the package was a transitive dependency far down in the dependency tree.
Interestingly enough, the predicament of exploding document size was a problem that npm had dealt with before.
Remember when we pointed out that a single package is actually a set of data managed by several different services? Like many things at npm, this also was not always the case.
Originally, npm’s registry was entirely contained by a single service, a CouchApp, on top of a CouchDB database. CouchDB is a database that uses JSON for documents, JavaScript for MapReduce indexes, and regular HTTP for its API.
CouchDB comes with an out-of-the-box functionality called CouchApp that is a web application served directly from CouchDB. npm’s registry was originally exclusively a CouchApp: packages were single, document-based entities with the tarballs as attachments on the documents. The simplicity of this architecture made it easy to work with and maintain, i.e., a totally reasonable version 1.
Soon after that, though, npm began to grow extremely quickly — package publishes and downloads exploded — and the original architecture scaled poorly. As packages grew in size and number, and dependency trees grew in length and complexity, performance ground to a halt and npm’s registry would crash often. This was a period of intense growing pains for npm.
To mitigate this situation, @isaacs split the registry into two pieces: a registry that had only metadata (attachments were moved to an object store called Manta and removed from the CouchDB), which he called skim, and another registry that contained both the metadata and the tarball attachment called full-fat. This splitting was the first of what would be multiple (and ongoing!) refactoring efforts to reduce the size of package metadata documents and distributing how we process packages across multiple services to improve performance.
If you look at the npm registry architecture today, you’ll see the effects of our now CTO @ceejbot’s effort to continue to split the monolith: slowly separating out registry functionality into multiple smaller services, some of which are no longer backed by the original CouchDB, and are backed by Postgres.
Plans for the Future
Turns out that nobody thinks that arbitrarily restricting README length is a good thing. There are plans in the works for a registry version 3, and changing up the README lifecycle is definitely in the cards. Much like the original shift that @isaacs made when he created the skim and full-fat registry services, the team would ideally like to see README data removed from the package metadata document and moved to a service that can render them and serve them statically to the website. This would bring several awesome benefits:
No more README truncating! Good-bye arbitrary restrictions!
Speeding up the website by moving markdown parsing to its own service.
Speeding up the website even more by pre-parsing READMEs and serving them statically instead of parsing them on request. (Yes we cache, but still…)
Serving READMEs for all versions of a package! By lowring the cost of READMEs, we can not only parse more of a single README, but parse more READMEs too! :)
npm cares deeply about backwards compatibility, so all of the original endpoints and functionality of our original API will continue to be supported as the npm regsitry grows out of its CouchApp and CouchDB origins. This means there will always be a service where you can request a package’s metadata and get the README for the latest version. However, npm itself doesn’t have to use that service. Moving on from it towards our vision of registry version 3 will be an awesome improvement, across several axes.
Happy Debugging!
A friend recently tweeted:
systems as designed are great, but systems as found are awful
This is not a shot at npm; this statement is pretty ubiquitously true. Most systems that are of any interest to anyone are the products of a long and likely complicated history of constraints and motivations, and such circumstances often produce strange results. As displeasing as the systems you find might be, there is still a pleasure in finding out how a system “works” (for certain values of “work,” of course).
In the end, the “fix” for the “bug” was “we’ve got a plan for that, but it’s gonna take a while.” That isn’t all that satisfying. However, the process of tracking down a seemingly simple element of the npm registry system and exploring it across services and time was extremely rewarding.
In fact, in the process of writing this post I became aware that Crates.io, the website for the Rust Programming Language’s package manager Cargo, was dealing with a very similar situation regarding their package READMEs. Instead of trying to remove them from their package metadata like us, they’re considering putting it in! If I hadn’t had the opportunity to dig around in the internals of npm’s registry, I might not have been ready to offer them suggestions with the strength of 5 years of experience.
So — the moral of the story is this: When you can, take the time to dig through the caves of your own software and ask questions about past decisions and lessons. Then, write down what you learn. It might be helpful one day, and probably sooner than you think.