For decades, the industry has been conditioned to believe that the loss of a drive or the loss of a server was the most tragic event that could happen in our data centers. As the industry started to use “servers” to provide Software-Defined Storage solutions, those concerns about system failure became even greater, since these standard servers were not exclusively designed from the ground up for storage availability. All of us know the traditional challenges of RAID and the fact that RAID sets can’t afford multiple drive failures without the risk of data loss.
Today, there are a good number of distributed storage solutions available that alleviate much of this concern, and SwiftStack is one of them. Distributed storage systems provide object and/or file services to applications and end users, but change the way that data is placed by distributing it across available resources in a cluster of nodes, instead of just drives inside of a single box. This method of broader data distribution ensures much higher levels of data availability and durability. This also means that when a resource in the system (like a drive) fails, it is not as critical of an event as it is with a traditional RAID-based system.
Failure Domains in a Distributed Storage System
To understand why drive failure is less of a critical event in a distributed storage system, it is important to understand the failure domains in this modern architecture. A data placement algorithm is used when writing data to the drives in the cluster and the algorithm is aware of the failure domains—Regions, Zones, Nodes and Drives. Data is placed across these domains so that a loss of a drive, node, zone, and even region, does not result is a loss of data:
- Region is most-commonly defined by a physical location, perhaps by city or state or even continent.
- Zone is traditionally used to designate a rack or row in a datacenter within the Region.
- Node is a server in the Zone.
- Drive is an individual device in the Node.
When the algorithm sufficiently distributes data across these domains, failure in the cluster can occur without affecting data availability or causing a critical event.
For example, if a customer has a cluster distributed across 3 regions and is protecting their data by replicating it 3 times across the cluster, then 1 replica will always be in each region. This would provide a region-level failure domain and would ensure data availability in the event of a complete data center outage. If the same company has only 1 region but 3 zones, then 1 replica will be placed in each rack in the data center. This would provide a rack level failure domain. If the same customer has 1 region and 1 zone with 4 nodes, then a replica will be randomly placed on 3 of the 4 nodes, providing node-level failure domain. What this is intended to do is to make sure that data is placed in “as unique as possible” locations to minimize the impact of any component, server, rack, or data center failure.
Understanding Drive Failure at Scale
If we take the above example and create a much larger scenario, across multiple geographically distributed regions, and use erasure coding to protect the data, you will see how the loss of multiple nodes, or even and entire region, will not impact data availability.
Assume a customer has 50PB of usable capacity across two regions, where a whole copy of the data is available in each region and protected using 15+3 erasure coding. Depending on the density of the servers used, a likely design might be as follows:
The above design allows for 48 nodes (servers) in each region, which would then be installed in 6 zones (racks) in each region. With this dual-site design, a 15+3 erasure coding policy will be used in each region to protect the data and a whole copy of the data will be replicated between regions. What this means is that a full region loss does not impact data durability and this also allows for all data retrieval to come from the local nodes versus reaching across the WAN to retrieve an object.
Using a “unique-as-possible” algorithm to place data, erasure coded fragments will be distributed equally across all zones within each region. Each rack would contain 3 fragments that would then be distributed across the physical servers. Keep in mind that each rack has 8 servers with 90 drives, so a total of 720 drives are available to hold those 3 fragments. If a drive fails, then the objects on that disk will get reconstructed across the other drives in that zone. If a full server fails, data is read from the drives in other servers. SwiftStack does not immediately rebuild the missing fragments until the customer triggers that recovery, as we don’t want to start rebuilding 1.2PB of data unless there is an actual confirmed server failure. With this dual-region, 15+3 erasure coding policy, you end up with 36 fragments written across both regions (18 fragments in each Region). 15 fragments are needed to be able to assemble and serve up an object successfully from the cluster. With this being a dual-region scenario, the cluster can reach across the WAN to find any fragments that are not available within the local region.
Let’s take all this information and apply this to some real world failure scenarios…
Single Drive Failure
If a customer loses a single drive (.0236% of region, .0118% of cluster capacity), the system will start to rebuild those fragments across the 4,229 drives within the region. If a fragment from an object was on that disk and a read request comes in, the system will look to find 15 fragments across all 4229 drives left in the system. Based on the data placement algorithm, those fragments will be split across racks and then across servers.
Full Server Failure
If a customer loses a full server (2.13% of region, 1.065% of cluster capacity), the system will wait for manual acknowledgement before rebuilding all the fragments on those disks. This leaves 4140 drives to still provide the fragments that are needed for the read request. Noting again the data placement algorithm is not going to allow for more than one fragment of an object on a single server. This ensures that a full server loss will only amount to a single fragment of unavailability.
In this scenario, losing a node results in 90x the raw capacity loss (for this server configuration) compared to a single drive, but if the cluster has enough available free capacity, this failure also does not produce a critical event.
Full Rack Failure
Assume that a full rack of 8 servers is down (17.04% of region, 8.505% of cluster capacity). Using what we have discussed, the algorithm will have put three fragments per object in each rack. This still provides us with the 15 fragments needed to successfully read an object.
For drive and server failure in clusters of this scale, it is common to rebuild and rebalance the cluster before replacing hardware, but when an entire rack fails, it is best to evaluate repairing the cause of the rack failure first before rebuilding this large amount of data. This type of failure is not critical from a data durability standpoint, but should be acted on as soon as possible.
Multiple Rack Failure
Let’s just take the example where we have multiple rack failures at the same time. As we have stated, each rack has 3 fragments. If we lost two racks, we would be down 6 fragments, which would put us below the 15 fragments needed for a valid read. In this case, the cluster will reach across the WAN and use any of those 18 fragments that are over there to provide a successful read.
This scenario does classify as a critical event because for a subset of the data, traffic needs to go across the WAN to serve the data its applications and users. This can impact service levels because of the likely reduction in storage performance.
To boil it down…
Modern storage platforms like SwiftStack provide an intelligent data placement and durability algorithm that distributes data across available resources to provide the highest level of data durability and data availability possible. As you can see above, the loss of infrastructure has little to no impact on the operation of a cluster. With the urgency removed from hardware failures such as boot disks, customers are able to leverage low-cost, commodity servers to provide a storage architecture that has more durability than any traditional, high-cost, RAID-powered, NAS or SAN.
To learn more about object storage, please check out other articles on our blog or visit the SwiftStack product pages.