2016-07-27

I have been craving to write about this since this is what I have been up to
lately at work. I spent quite some time investigating the state of instance
provisioning on each cloud provider and I thought I could share these here.

If you are dealing with deploying instances (a.k.a Virtual Machines or VMs) to
public cloud (e.g. AWS, Azure), then you might be wondering what your instance
goes through before you can start using it.

This article is going to be about that. I hope you enjoy it. Please let me know
at the end how you liked it!

Table of Contents

What Is Provisioning?

Key Tools for Provisioning

Instance Metadata API

Cloud-Init

Provisioning in Public Cloud

AWS EC2

DigitalOcean

Google Compute Engine

Microsoft Azure

Conclusion

What Is Provisioning?

You have an application and you need to run this without purchasing physical
servers. You go to a cloud provider and you ask for virtual servers to run
your service on. You tell the cloud provider: “I want a Debian 8 Linux machine,
and I want you to add this SSH public key to the machine so that I can log in
and run my app”. You get what you asked for within a matter of
seconds if not minutes.

(Very ordinary, right? I am very sure this was mind-blowing decade ago when AWS
EC2 came out. It is certainly still mind-blowing to me.)

All operations that occur from the moment you request for a VM to the moment you
can log in to the VM is called provisioning.

Most of the provisioning magic happens at cloud provider’s proprietary/internal
software that manages their physical machines in the datacenter. A physical node
is picked and the VM image you specified is copied to the machine and hypervisor
boots up your VM. This is provisioning from the infrastructure side and we are
not going to be talking about it here.

Then the provisioning goes on… Your machine is now up (think of a Debian or
Ubuntu Server image). It has no accounts, no SSH keys. It is almost like a
vanilla OS image you can download from internet yourself. You can’t log in.

This is where the user-mode provisioning kicks in. Your machine runs some code
and starts doing specialization on this image. Some of them could be:

creating the OS user you wanted

adding SSH credentials to the machine so that you can log in

running startup scripts you provided (to install or configure stuff)

mounting an ephemeral/scratch disk from the physical host

All this is part of provisioning and once this is all done, you have a VM
prepared for you to log in and use it!

This user-mode provisioning runs only once. When it is all set, it gets out of
your way and lets you run your workloads.

Key Tools for Provisioning

How does a vanilla Linux server image know about your configuration such as your
credentials and set up the Virtual Machine accordingly? To understand that you
should know about some tools that play key roles here:

⚒ Instance Metadata API

This is a HTTP API that runs at http://196.254.196.254/ if you have an VM
running on the cloud. You make calls to it and it gives you information about
your VM such
as:
- what is the name of the VM
- what is the instance size
- what region is your VM in
- what are the SSH public keys assigned to the VM
- what is the startup script that the VM should execute

It is a trivial and text-based API. You just make the request, you get what you want:

Instance Metadata is provided to the VM by the hypervisor or other underlying
infrastructure of the cloud provider. This is how your VM knows about itself and
what it should do.

In case you are interested, I have a comprehensive blog post about comparison
of instance metadata APIs across public cloud providers on my blog.

⚒ Cloud-Init

cloud-init is a Linux tool that runs when your instance boots to handle
provisioning from within the instance. It sets up your virtual machine by
configuring networking, hostname, placing your SSH credentials and optionally,
by running startup scripts you provided.

cloud-init detects which cloud provider you are running on (by doing certain
heuristics on the filesystem or metadata API protocol) then figures out which
data source class to use.

Then it calls the data source class to get the data (such as SSH keys) it needs
provisions your instance with that. Basically, the cloud-init package is what
makes a server image different than a vanilla distro image.

cloud-init is written in Python and originally developed by Canonical for
Ubuntu Server, however over time it has gotten populer over time and got love
from other Linux distro vendors as well as cloud providers and became a
widely-adopted package baked in many distro images on the cloud.

Provisioning in Public Cloud

In this section I am going to explain how each cloud provider does provisioning
of the instances in user space. Some of them are similar, although some have
differences interesting enough to point out.

☁︎ Amazon Web Services EC2

Obviously AWS started doing all this since it is the first IaaS provider, but
the notion of provisioning is older than that as people used to have
virtualization software they ran in their on-premise datacenters or servers.

The way EC2 provisions instances is plain and simple:

Most images on EC2 (AMIs) have cloud-init baked in to the image.

cloud-init queries EC2 Instance Metadata API and gathers data to
provision the instance.

The whole AWS implementation in cloud-init is only 200 lines of code. I
think having cloud-init everywhere gives EC2 a clean and unified way of
provisioning Linux instances.

☁ DigitalOcean

I love DigitalOcean and use it personally myself. When it comes to provisioning,
they follow the AWS EC2 principle:

have cloud-init package baked on all images

use metadata API to get the data about the instance.

Since DigitalOcean has only a few images available (at least today),
provisioning is not very exciting here either. The cloud-init implementation
of DigitalOcean is also very small, just 110 lines of code.

The only difference I spotted is DigitalOcean has a way of reordering the steps
executed by cloud-init (listed in cloud.cfg) via a cloud-init feature
called the vendor-data. This data also comes from metadata API and as far as I can
tell nobody except DigitalOcean uses this feature in the cloud-init codebase.
They use it for keeping the root user enabled, managing /etc/hosts via
cloud-init etc. (In case you want to dig deep, here is the vendor-data
DigitalOcean presents and its diff with cloud.cfg.)

☁ Google Compute Engine

Google does not use cloud-init (BOOM!). I do not know why but what they came
up with instead is remarkably cool:

Google wrote their own instance guest agents in Python. These are installed on
all stock images on GCE and open source on GitHub. And I said agents,
meaning not a single monolithic agent but a bunch of small services. You can
find a list of these in your GCE instance:

Self-documenting enough… The one I really want to talk about is
google-accounts-daemon, the one which creates the user accounts and places SSH
keys you have given to create your VM.

Those who are GCE customers will know these two fantastic features: If they ever
lose their credentials, they can drop new SSH keys to the VM from gcloud CLI or
the web console at any time and restore access.

Or another killer feature: GCE has a “SSH” button on the web interface and the
gcloud compute ssh <vm> command to give you on-the-fly SSH access to the instance
by creating short-lived SSH keys and dropping them to the instance in 10 seconds.

What is this magic? How is this possible and so fast? Well, first you need to
know this: Google Metadata API supports long polling. This means you
start a long-standing HTTP request to metadata API and if the key you are
watching changes, the server returns a response with the new values. Then the
google-account-daemon parses the instance/attributes/ssh-keys value in the
response (which contains the new SSH keys) and creates Linux users and adds SSH
keys accordingly. This is how the magic works.

If you ever lose your SSH key on EC2 or DigitalOcean, you are doomed. But GCE
has this (and Azure has something similar). So this is pretty cool.

I said Google does not use cloud-init, but if you bring your own custom VM image
(that has cloud-init package in it) it will provision just fine as there
is a GCE implementation in cloud-init. It is short (160 lines of code) and
just queries GCE Metadata API to get all the data it needs to provision. If you
go down this route, you won’t be getting all these cool features.

☁︎ Microsoft Azure

(Before I begin, quick disclaimer to save my butt: I work at Microsoft Azure
Linux team and this is precisely the area that I work on. These are my personal
opinions and it goes without saying that I tried to write this section
objectively as much as I can.)

Azure has started as a Windows PaaS provider in 2010 and stayed as such until
2013, when IaaS was made generally available. When IaaS was launched,
Azure did not have many Linux images, however it was picking up. (read:
Microsoft is now doing Linux, can ya believe it?!! and I was there!!1) However
as most of the infrastructure and the APIs were designed for Windows,
provisioning on Azure Linux instances is a bit unconventional and non-trivial.

As of this writing, Azure does not have an instance metadata service. It has an
undocumented HTTP API which is internally called “Wire Server”. One of the Red
Hat engineers kindly documented it here. It is XML-based (compared to other
metadata servers being JSON/text based) and a bit cryptic at first.

However, Azure does not use this Wire Server for provisioning. So where do we get the
provisioning data from? When a new virtual machine boots for the first time,
Hyper-V (hypervisor of Azure datacenters) attaches a DVD-ROM device to the
instance. Then the provisioning code mounts the device and reads a file called
ovf-env.xml from it. This file contains username, SSH key and/or password data
(yes, Azure allows creating Linux VMs with passwords) and is used to provision
the instance.

So who provisions the instance then? Well, it depends. First of all Azure has
its own guest agent running on all Linux instances, called waagent. It
is written in Python and open source on GitHub.

On most Linux images on Azure, waagent is the provisioning tool. However, for
images like Debian and Ubuntu Server, cloud-init does the provisioning and
waagent is still there. This is mostly because Azure guest agent does a lot
more than provisioning. It has quite many tasks, such as some important ones:

formatting the ephemeral disk to ext4 (Hyper-V gives it as NTFS)

processing virtual machine extensions

enabling RDMA (remote direct memory access) for HPC (high performance
computing) workloads

retrieving and placing instance certificates and private keys by converting
them from PKCS#12 format to PEM (PKCS#12 works well with Windows, however it
is not conventional in POSIX environments.)

In cases where cloud-init and are waagent both present, they coordinate and
do not step on each other (although these parts are a bit hacky). Finally, Azure
has to send a “provisioned” signal to the Wire Server, otherwise Azure thinks
that the provisioning is still going on your VM will not be listed as started on
the API or the management portal.

Some of the extra features of waagent such as the VM extensions offers
flexibility to the users, which lets users to install application bundles called
extensions to their VMs via CLI or REST APIs. Extensions can do things like
restore access or run arbitrary scripts or install anti-malware
software and such on the machine without even having to do SSH.

As I said before, provisioning in Azure is a bit non-trivial. You can see this
in the cloud-init Azure data source implementation which is like 650
lines long (+another 280 lines of utility methods). This is mostly
because Azure has no text-based metadata service and it has an unconventional
way of delivering pieces of information and the need for signaling that
provisioning has completed via a complicated protocol.

Conclusion

Now you know how each cloud provider brings up your virtual machine.
cloud-init is huge and plays a key role in instance provisioning across the
cloud providers. Instance Metadata APIs are another key to do provisioning for
cloud providers.

Most people have no reason to care about these or even know about them –as long
as the cloud providers are doing their job right. But I hope you now have more
visibility into what your instances on the cloud go through before you get to
use them.

If you have read this far please let me know in the comments about what you think!

Show more