Today John Dickinson posted on the OpenStack mailing list a message with links to the design for erasure codes (EC for short) in Swift. This design has been built with the collaboration of Intel, Box and the input of other users of Swift. This post will go a bit deeper into the design philosophy of OpenStack Swift and how EC will fit in.
Erasure codes will enable a cluster to reduce the overall storage footprint while maintaining a high degree of durability. This is a huge benefit for some use cases for Swift. But that’s not the whole story! Why didn’t Swift start with EC in the first place?
There is no free lunch
Saving space seems to be a no brainer. Store more with less? Of course! But, alas, there is a cost with saving space as well, we live in a world of tradeoffs.
The CAP Theorem says that when designing a distributed system, there isConsistency, Availability and Partition Tolerance, pick any two. Consistency means that all parts of the system see the same data at the same time. Availability is a guarantee that every request receives a response about whether it was successful or has failed. And partition tolerance means the system continues to operate despite arbitrary message loss or failure of part of the system.
Swift chooses Availability and Partition Tolerance over Consistency.
What this means is that when there is a partition (like servers not being able to talk to each other), Swift is tolerant of this partition and will still service the request. Because Swift replicates data across failure domains, any segment of the cluster will be able to serve the object and the system remains highly available.
Availability in the face of partitions is the huge benefit of using replicas. But there are also a couple of other benefits:
Latency and Reconstruction
For many of our customers, concurrency and latency rule the day. In a replica model, a read request can be served from a single storage volume. This means less network overhead, less CPU effort, and results in lower latency to the user.
Failure handling is fairly simple in a replicated system, too. When a drive fails and is replaced, any other remaining replica can copy the data directly to the new replacement drive. This simple copy-only operation has low CPU requirements and doesn’t need to coordinate connections with many servers in the cluster.
Replicas and Regions
In Swift’s current replication model, all data is fully copied multiple times throughout the cluster. At SwiftStack we recommend 3 replicas for single-site Swift clusters and 2 x 2 replicas for two, globally-distributed sites. This strategy gives the cluster fantastic durability and availability since whole copies of data are distributed across the failure domains and requests can be served from local replicas instead of being reassembled from bits across the WAN.
More than one way to slice an app
We love the replica model. It’s operationally simple, low-latency, and highly available. This pairs up nicely with the use cases of many production Swift clusters. Users expect the data to be available instantly, on any device and stored forever!
However! Why not provide replicas where there is an advantage AND erasure codes to save space when the demands on the data are less intensive?
Erasure Codes with Swift
To provide a smaller storage footprint, we will build on the existing components in Swift to provide a highly-durable storage system using erasure codes. The design goal is to be able to have erasure-coded storage in addition to replicas in Swift. This will allow a choice in how to store data and make the right tradeoffs based on the use case.
Swift currently uses a data structure called the Ring to distribute a partition space across the cluster. This partition space is core to the replication system in Swift. It allows Swift to quickly and easily synchronize each partition across the cluster. When any component in Swift needs to interact with data, a quick lookup is done locally in the Ring to determine the possible partitions for each replica.
Swift already has three rings to store different types of data in Swift. There is one for account information (Swift, after all is a multi-tenant storage system), another for containers (so that it’s convenient to organize objects under an account) and another for the object replicas. To support erasure codes, there will be an additional ring that is created to store erasure code chunks.
In a Swift cluster the proxy servers stream requests to the appropriate storage nodes. When a request is made to store an object that is erasure coded, the proxy tier will encode the object into chunks and stream those chunks to appropriate partitions.
Reading the Data
When a request is made to read an object that is erasure coded, the proxy tier will make the appropriate connections to the storage servers and decode the object data as it is sent to the client. One advantage of erasure coding is that a drive failure during a read can still result in valid data being sent to the client if there are enough other chunks to reconstruct the missing part.
Building on tested components
There are a few advantages to doing the encoding in the proxy tier. First, is it enables us to stand on the shoulders of the existing, battle-hardened mechanisms that Swift already has. Second, it concentrates the CPU constraints into the existing CPU-heavy proxy servers. In most real-world deployments, storage capacity grows faster than concurrent access requirements, so keeping the erasure coding machinery in the proxy server allows deployers to keep costs low.
We can use the existing concept of handoff locations in Swift to store parts when (not if) there are physical hardware or connectivity failures. Each node in the system knows how to dig deeper for chunks if they are not found in their primary location.
The replicator is a continuously running to ensure that each partition has in it what it is supposed to have. A hash file is created in each each partition which is a ‘hash of hashes’ of every file in the partition. That way, replica contents can be quickly compared with a ‘check your twins’ model.
With erasure codes, we can use the same strategy to ‘check your siblings’. The replicator equivalent for erasure codes, the “reconstructor”, can check neighboring chunks quickly to ensure that individual chunks are available. When there is a failure, the reconstructor can rebuild and push the chunks to the locations where they are missing.
Containers for control granularity
Containers are used in Swift similar to how buckets are used in AWS S3. They provide a good handle that allows an account to provide attributes on a collection of objects, for example setting access control lists (ACLs). Containers are a good organizing method to set a storage policy to replication or erasure code.
Use different equipment for EC Tier
The advantage of using an additional ring is to be able to store erasure coded data on different physical hardware if desired. For example, if the data is used for active archive purposes, then it may make sense to use slower, more dense hardware to further drive down costs.
The tradeoffs are pretty simple. Erasure coded objects are stored across many more servers than replicated data. This means that each read and write must establish more network connections to the storage nodes. These additional connections increase the complexity of the failure handling requirements and add latency to each request. Also, since the data must be encoded on write and decoded on read, there is CPU overhead introduced by erasure codes.
While there are tradeoffs, erasure codes in Swift will provide an option for storage savings without sacrificing the overall durability of the system. Now, deployers of Swift will be able to choose between both EC and Replicas based on which is best for their use case.
* For more information on Erasure Codes, check out these great posts: Behind the Scenes: Erasure Codes – Huge Community Effort, The Foundations of Erasure Codes and Behind the Scenes: “Under the Hood” with Erasure Codes