Blog.dreamhost.com

Cloud Storage Architectures

2013-09-12

A couple days ago, Randy Bias, the CEO of CloudScaling released an interesting white paper titled Converged Storage, Wishful Thinking and Reality. In the article he discusses distributed storage, where he believes it’s useful, and perceived shortcomings of distributed storage technologies. The white paper can be distilled down to three main bullet points, we’ll start with the first two by describing how Ceph can adapt to these challenges.

1. Cost – The economics of disk drives, SSDs and tape

2. Technology – The wide variance of technology applications for spinning disks, SSDs and tape

In the whitepaper we find the statement:

In the case of hybrid storage, SSDs are frequently used to speed up pools of spinning disk drives, optimizing and balancing between cost and latency in a sane manner. It won’t work for all workloads, but it works for many or even most.

So hybrid storage can get us pretty far and is acceptable for many/most workloads, how can we do that with Ceph?

Ceph’s Hybrid Storage and Caching Strategies

Ceph object storage daemons can be created using multiple devices. This is typically done to strike a balance between low capacity, low latency, high throughput storage devices like SSDs and high capacity, high latency, mediocre throughput disk drives. Ceph OSDs are composed of two volumes, these volumes can be colocated on a single storage device or split between storage devices of differing characteristics.

Hard disk -> Partition -> Filesystem -> Ceph data volume

SSD -> Partition -> Ceph journal volume

Multi-device OSD configurations benefit from being able to achieve full write throughput; that said, if your data volume storage media is slow at performing writes you can accelerate it with a kernel block layer cache like flashcache or bcache. Reads can be served from a replica on any of the OSDs that compose a placement group and hot objects can often be served directly from the disk cache.

Client side caching is also an option to increase read performance. For Ceph’s RADOS block device there is configurable caching. If you have chosen to consume Ceph using the CephFS distributed filesystem then you have the page cache on the OSDs, page cache on the clients, and soon the ability to use FScache (currently merging into linux mainline). If you’re using Ceph in an Object Storage capacity, similar to S3/Swift then you can cache at the front end with something like Varnish and purge objects from caches after PUT/POST requests. A content delivery network can also be configured to use a Ceph Object Store as origin storage for geo-replicated caches. Finally, Ceph could safely add caching to the RADOS gateway by caching immutable tail objects. This is possible because a S3/Swift “object” is actually composed internally of a RADOS head object, which maps to N RADOS tail objects striped across OSDs, where N is dependent on the size of the object. That’s a lot of flexibility.

Future Tiering

Tiering is a roadmap item for Ceph, slated for the Emperor release. The idea of tiering in Ceph is that you create multiple RADOS pools in a single cluster, with the distinguishing characteristic of each pool being the underlying configuration of the nodes, their network and their storage devices. Each pool has a tunable replication factor and all replication is intra-pool.

With two or more storage pools with various durability, availability and performance characteristics you soon will be able to do several interesting things. The first interesting feature on the Ceph roadmap is being able to redirect objects from one pool, to another pool. If you decide that some objects need a different replication factor or belongs on a different storage tier, you can essentially symlink them to a pool with the required characteristics. The blueprint for object redirects is here.

The second interesting feature on the Ceph roadmap is using one pool as a caching layer on top of another pool. The results are similar to multi-device OSDs but the advantages are read caching and deployment flexibility, ie not having to deploy nodes with a specific SSD/Disk ratio or symmetrical network links. The blueprint for cache pool overlays is here.

CAP, Failure Domains, and Multi-Site Strategies

Taking the discussion to a macro level and considering multi-site storage strategies is where things get really interesting, this is why tackling the white papers #3 will be fun.

3. CAP – The unremitting truth of the CAP theorem

The paper points to Ceph being a massive failure domain (software) and how implied usage contracts might be violated by converging Block and Object storage within this failure domain. Anything, if exposed to a long enough time period, will fail. This is fact. Whole servers, server components, power distribution, cooling, networking equipment, software and even entire datacenters are brittle and will fail at some point in time, we should factor that into the design of our systems.

Ceph was designed from the very beginning to be highly resilient to a wide variety of failure scenarios. I wont get into gory details but those interested can read the original papers on Ceph and its CRUSH algorithm. Ceph is a CP system. This isn’t purely academic, at DreamHost, prior to launching DreamObjects into production we spent several weeks formulating and executing different “Game Day” scenarios. These simulations are similar to what Amazon and Google do on a regular basis, based on anecdotes from Jesse Robbins and Kripa Krishnan. The scenarios we modeled were switch failures on both front and back end networks, losing a rack, losing a row, contracting the cluster by a rack and expanding the cluster by a rack, among others. We were extremely impressed with Ceph’s ability to tolerate these scenarios. We’ve even witnessed a datacenter wide power outage take down a Ceph cluster and watched it cold boot.

Speaking of datacenter failures, take into consideration this excerpt from Google’s book The Datacenter as a Computer

Most commercial datacenters fall somewhere between tiers III and IV, choosing a balance between construction costs and reliability. Real-world datacenter reliability is also strongly influenced by the quality of the organization running the datacenter, not just by the datacenter’s design. Typical availability estimates used in the industry range from 99.7% availability for tier II datacenters to 99.98% and 99.995% for tiers III and IV, respectively

If you want to achieve better than four nines availability, or if you want your data to survive a datacenter disaster then you better have a second site. A Ceph storage system could be built with two sites, one cluster per site, with converged storage and still meet the white papers implied usage contract regarding durability of object storage. Each site/cluster will be configured with at least three storage pools, primary object, secondary object and block. Each primary object storage pool would be configured to replicate to the secondary at a separate site. If there is a Ceph cluster meltdown (for whatever reason, DC, network, software, etc) then new VMs can be provisioned at the second site, from snapshots stored in the secondary pool at the facility with the healthy Ceph cluster.

This seems like a sane strategy to me. Lets explore how several other public cloud storage systems are designed:

Examining the Amazon S3 documentation we get the gist of how their system is built:

To increase durability, Amazon S3 synchronously stores your data across multiple facilities before returning SUCCESS. In addition, Amazon S3 calculates checksums on all network traffic to detect corruption of data packets when storing or retrieving data. Unlike traditional systems which can require laborious data verification and manual repair, Amazon S3 performs regular, systematic data integrity checks and is built to be automatically self-healing.

* Backed with the Amazon S3 Service Level Agreement.

* Designed for 99.999999999% durability and 99.99% availability of objects over a given year.

* Designed to sustain the concurrent loss of data in two facilities.

The RRS option stores objects on multiple devices across multiple facilities, providing 400 times the durability of a typical disk drive, but does not replicate objects as many times as standard Amazon S3 storage, and thus is even more cost effective.

* Backed with the Amazon S3 Service Level Agreement.

* Designed to provide 99.99% durability and 99.99% availability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.01% of objects.

* Designed to sustain the loss of data in a single facility.

Amazon is unique, no other Object Storage system provides the same contract. It sounds like they layered a CP system on top of several AP systems.

Now lets examine how OpenStack Swift works, the component used by HP Cloud Storage, Rackspace CloudFiles and CloudScaling OCS for Object Storage.

The data then is sent to each storage node where it is placed in the appropriate Partition. A quorum is required — at least two of the three writes must be successful before the client is notified that the upload was successful.

That doesn’t exactly jive with the contract of Amazon S3, is that bad? It’s different, that’s for sure. Swift is always an AP system. Next I’ll dig into how Windows Azure Storage works:

Windows Azure provides two different kinds of blobs. The choices are:

Block blobs, each of which can contain up to 200 gigabytes of data. As its name suggests, a block blob is subdivided into some number of blocks. If a failure occurs while transferring a block blob, retransmission can resume with the most recent block rather than sending the entire blob again. Block blobs are a quite general approach to storage, and they’re the most commonly used blob type today.

Page blobs, which can be as large at one terabyte each. Page blobs are designed for random access, and so each one is divided into some number of pages. An application is free to read and write individual pages at random in the blob. In Windows Azure Virtual Machines, for example, VMs you create use page blobs as persistent storage for both OS disks and data disks.

Also see:

To guard against hardware failures and improve availability, every blob is replicated across three computers in a Windows Azure datacenter. Writing to a blob updates all three copies, so later reads won’t see inconsistent results. You can also specify that a blob’s data should be copied to another Windows Azure datacenter in the same region but at least 500 miles away. This copying, called geo-replication, happens within a few minutes of an update to the blob, and it’s useful for disaster recovery.

It’s pretty clear that Microsoft isn’t afraid of converged storage, they also use triplicate replicas in a primary facility with optional asynchronous replication to a secondary facility. Single site CP system and optionally a multi-site layered CP/AP system.

Moving on, we’ll examine Manta, the Object Storage and integrated compute offering from Joyent.

From the perspective of the CAP theorem, the system is strongly consistent. It chooses to be strongly consistent, at the risk of more HTTP 500 errors than an eventually consistent system. This system is engineered to minimize errors in the event of network or system failures, and to recover as quickly as possible, but more errors will occur than in an eventually consistent system. However, you can always read your writes immediately, and the distinction between a HTTP 404 response and a HTTP 500 response is very clear. A 404 response really means your data isn’t there. A 500 response means that it might be, but there is some sort of outage.

By default, the system stores two copies of your object. These two copies are placed in two different datacenters. The system relies on ZFS RAID-Z to store your objects, so your durability is actually greater than two would imply. Your data is erasure encoded across a large number of disks on physically separate machines.

So with Manta you get N+1 replicas, located at N+1 data centers, with additional erasure coding at the system level. Manta is strictly a CP system. The last public cloud storage offering I’ll examine is Google Cloud Storage, brought to us by the undisputed heavyweight contender in planet scale computing; Google.

From a consistency standpoint, Google Cloud Storage provides strong read-after-write consistency for all upload (PUT) operations. When you upload a file to Google Cloud Storage, and you receive a success response, the object is immediately available for download (GET) operations. This is true if you are uploading a new file or uploading and overwriting an existing file.

Sounds like they also layered an AP system on top of a CP system, but it’s hard to say because their documentation lacks sufficient hints to discern what they are really up to. Finally, I’ll examine another Object Storage system – Riak Cloud Storage.

In Riak CS, large objects are broken into blocks and streamed to the underlying Riak cluster on write, where they are replicated for high availability (3 replicas by default). A manifest for each object is maintained so that blocks can be retrieved from the cluster and the full object presented to clients. For multi-site replication in Riak CS, global information for users, bucket information and manifests are streamed in real-time from a primary implementation to a secondary site so global state is maintained across locations. Objects can then be replicated in either full sync or real-time sync mode.

Riak CS can be a CP or AP system layered on AP systems. This is probably the closest analog to Amazon’s offering.

In summary, Google Cloud Storage, Microsoft Azure Storage, Joyent’s Manta and Ceph all feature consistency and replication to another facility, and for fun you can throw in HDFS. This has a striking resemblance to RDBMS systems composed of machines, we’re just switching out our Legos for Cinderblocks.

Hybrid storage with Ceph is a reality now and the roadmap for tiering looks rather promising with regard to addressing the concerns raised by the white paper. The way Ceph can be used as a converged storage system isn’t inherently bad, it’s not a myth, it’s an option, Ceph is flexible like that. This flexibility empowers system designers to satisfy a diverse set of storage requirements. I’ll take the white papers red and blue pill, swallow them both and enjoy watching Ceph and other distributed storage systems disrupt the storage market.