2017-03-09

I've got a medium-sized project now that's just nearing the end of the "sloppy caffeine-powered prototypes for client demos" phase and transitioning into "think about the future" phase. The project consists of Linux-based devices with software and firmware, and a central administrative web server. 10 prototypes currently exist, production is expected to be on the order of low 1000's.

Not being well-versed in the art of auto-updates, and being short on time, I had quickly rolled my own software deployment / auto-update strategy and, frankly, it sucks. It currently consists of the following:

A hosted git repo (GitLab) with a production release branch (note the web server source is also in this same repo, as well as a few other things).

A "deploy update" button on the web interface that:

Pulls the latest version from the production release branch into a local repo area and also copies it to a temporary package prep staging area.

Runs a sanitization script (stored in the repo) in the staging area to remove unrelated source files (e.g. server source, firmware source, etc.) and .git files.

Writes the current git hash to a file in the update package (purpose will become clear below).

If all went well, it gzips it and makes it ready to serve by overwriting the previous gzipped package with a file of the same name, then deletes the staging area.

Note that there are now two copies of the current device software on the server, which are expected to be in sync: A full local git repo on the latest production branch, and a ready-to-go gzipped package that is now assumed to represent that same version.

Software on the device is self-contained in a directory named /opt/example/current, which is a symlink to the current version of the software.

An auto-update function on the device that, on boot:

Checks for the presence of a do_not_update file and takes no further action if it exists (for dev devices, see below).

Reads the current commit hash from the above mentioned text file.

Makes an HTTP request to the server with that hash as a query parameter. The server will either respond with a 304 (hash is current version) or will serve the gzipped update package.

Installs the update package, if one was received, into /opt/example by:

Extracting the updated software info a folder named stage.

Running a post installation script from the update package that does things like make necessary local changes for that update, etc.

Copying the current software root folder to previous (deletes existing previous first, if there is one).

Copying the stage folder to latest (deletes existing latest first, if there is one).

Ensuring the current symlink to point to latest.

Rebooting the device (firmware updates, if present, are applied on reboot).

There is also the issue of initial deployment on newly constructed devices. The devices are currently SD card based (has its own set of problems, out of scope here) so this process consists of:

An SD image exists that has some stable earlier version of the software on it.

An SD card is created from this image.

On first boot various first-time device-specific (serial number based) initialization takes place and then the auto-updater grabs and installs the latest production version of the software as per usual.

Additionally I needed support for development devices. For development devices:

A full local git repo is maintained on the device.

The current symlink points to the development directory.

A local do_not_update file exists which prevents the auto-updater from blowing away development code with a production update.

Now, the deployment process was theoretically intended to be:

Once code is ready for deployment push it to the release branch.

Press the "deploy update" button on the server.

The update is now live and devices will auto-update the next time they check.

However there are a ton of problems in practice:

The web server code is in the same repo as the device code, and the server has a local git repo that I execute out of. The latest web server code is not on the same branch as the latest device code. The directory structure is problematic. When the "deploy update" button pulls the latest version from the production branch, it pulls it into a subdirectory of the server code. This means that when I deploy to a server from scratch, I have to manually "seed" this subdirectory by grabbing the device production branch into it, because, probably from git user error on my part, if I don't the deployment attempts to pull the device code from the parent directory's web server branch. I think this is solvable by making the staging area not be a subdirectory of the server's local git repo.

The web server currently does not maintain the git hash of the device software persistently. On server startup it does a git rev-parse HEAD in its local device software repo to retrieve the current hash. For reasons I can't wrap my head around this is also causing a ton of logic errors that I won't describe here, suffice it to say that sometimes restarting the server screws things up, especially if the server is brand new and no production branch repo has been pulled yet. I'd happily share the source for that logic if requested, but this post is getting long.

If the sanitization script (server side) fails for some reason, then the server is left with an up-to-date repo but an out-of-sync/missing update package, thus git rev-parse HEAD will return a hash that does not match what's actually being served to the devices, and problems here must be corrected manually on the server command line. I.e. the server does not know the update package is not correct, it merely always assumes so on pure faith. This combined with the previous points makes the server extremely fragile in practice.

One of the biggest problems is: There is currently no separate updater daemon running on the device. Due to complications waiting for wifi internet access to come up and some last minute hackery, its the main device control software itself that checks and updates the device. This means that if somehow a poorly tested version makes it into production, and the control software can't start, all devices that exist are essentially bricked, as it can no longer update itself. This would be an absolute nightmare in production. Same deal for a single device if it loses power at an unlucky time.

The other major problem is: There is no support for incremental updates. If a device, say, isn't turned on for a while, then the next time its updated it skips a bunch of release versions, it has to be able to do a direct version-skipping update. The consequence of this is update deployment is a nightmare of making sure that any given update can be applied on top of any given past version. Furthermore, since git hashes are used to identify versions rather than version numbers, lexicographical comparison of versions to facilitate incremental updates is currently not possible.

A new requirement that I do not currently support is that there will exist some per-device configuration options (key/value pairs) that must be configured on the administrative server side. I wouldn't mind somehow serving these per-device options back to the device in the same HTTP request as the software update (perhaps I could encapsulate it in HTTP headers/cookies) although I'm not too concerned about this, as I can always make it a separate HTTP request.

There is a slight complication due to the fact that two (and more in the future) versions of the hardware exist. The current version of the hardware is actually stored as an environment variable on its initial SD image (they can't self-identify) and all software is designed to be compatible with all versions of the devices. Firmware updates are chosen based on this environment variable and the update package contains firmware for all versions of the hardware. I can live with this although it is a bit clunky.

There currently exists no way to manually upload an update to the device (long story short these devices have two wifi adapters in them, one to connect to the internet, and one in AP mode that the user uses to configure the device; in the future I intend to add an "update software" function to the device's local web interface). This is not a huge deal, but does have some impact on the update installation method.

A bunch of other frustrations and general unsafeness.

So... that was long. But my question boils down to this:

How do I do this properly and safely? Are there small adjustments I can make to my existing process? Is there a time-tested strategy / existing system I can leverage so that I don't have to roll my own crappy update system? Or if I do have to roll my own, what are the things that must be true in order for a deployment/update process to be safe and successful? I have to also be able to include development devices in the mix.

I hope the question is clear. I realize it's a bit fuzzy, but I am 100% sure that this is a problem that has been tackled before and successfully solved, I just do not know what the current accepted strategies are.

Show more