Several of the organizations we work with here at SwiftStack operate multiple, geographically distributed datacenters. In some cases, disaster recovery is the main driver – in other cases, applications in different datacenters all need low latency access to data. We are now working towards evolving Swift where a single cluster can be distributed over multiple, geographically dispersed sites, joined via high-latency network connections.
Swift is currently designed to work in a single region where a region is defined as a low latency link between Swift zones. As long as sites are nearby, zones can be distributed over multiple sites. For example, if you have three nearby sites and you put a zone in each, you can guarantee that you have one replica in each region. For geographically distant datacenters, however, Swift’s ‘container sync’ feature is currently the main option to synchronize data.
As a core contributor to Swift, SwiftStack is keenly interested in pushing forward the development of Swift, specifically to enable Swift to support geographically distributed clusters. We got started on this development in the 1.5 release of Swift and are now continuing this work, which will be made available in upcoming releases of Swift. This overview details a design for this feature. We are also looking forward to discussing this in the upcoming OpenStack Summit in October.
(For those who want to watch rather than read – here is a video overview)
Let’s recap how Swift is designed, which will set the stage for how Swift will be able to support globally distributed clusters.
A Brief Overview of Swift Architecture:
Swift is composed of two tiers:
- Proxy Nodes. Their role is to coordinate requests with the appropriate storage nodes. The proxy nodes can live anywhere, so long that they have a route to the storage nodes. Traffic is load-balanced across all the proxy nodes. Clients make requests to the proxy nodes.
- Storage Nodes. These are the nodes that actually store the data. Clients never talk directly to the storage nodes.
How they work together:
- An incoming write request streams incoming requests to each replica location simultaneously. In the normal case, all 3 replicas are written to at the same time. In a failure case, a quorum will be used (for example, 2 out of 3 writes will need to be successful).
- Failures are normally detected by timeouts – if a storage node does not respond in a reasonable about of time, the proxy considers it to be unavailable and will not attempt to communicate with it for a while.
Because server availability is detected using timeouts, it’s best to have relatively low latency between each node in the cluster. High latency links will cause the proxy node to see more storage node failures than may actually exist. For example, if it takes 350 milliseconds to establish a connection with a storage node and the connection timeout is 500 milliseconds, a small variation in network conditions will cause the proxy to mark that storage node as unavailable.
Swift currently has a feature called container synchronization. With this feature, two independent clusters can be configured to synchronize data between them on a container-by-container basis. To configure this feature, containers in each cluster are identified and data is synchronized between them either 1-way or 2-way using a shared secret synchronization key.
This approach, while robust, results in many replicas because each cluster has its own set of replicas. For example, if you use a replication factor of 3 in each cluster, you would end up with 6 total replicas of each object. Additionally, replication only happens asynchronously. Furthermore, as the number of containers to sync grows, the user burden of managing which containers are synced and their respective synchronization keys increases significantly.
In OpenStack Swift version 1.5.0, Sam Merritt in the SwiftStack team added Tiered Zones into Swift, which is a prerequisite feature for building a globally distributed Swift cluster. Prior to this, Swift had a rigid concept of zoning where replicas were only created in unique zones, and deployers were required to have at least as many zones as replicas in the cluster. The new Tiered Zones feature provided Swift deployers more flexibility in designing clusters and Swift users with more levels of fault-tolerant-domains which it could use to place data. With this significant change, the model in Swift is now ‘as-unique-as-possible’ for placement of replicas.
Practically, this allows small clusters to easily grow up into big clusters as data is placed ‘as-unique-as-possible’ as the cluster expands. In the Folsom release of OpenStack, Swift can uniquely place replicas according to drives, nodes, racks, PDUs, network segments and data center rooms.
A Globally Distributed Cluster:
To create a global cluster, the new concept of Region will be introduced in Swift. A Region is bigger than a Zone and extends the concept of Tiered Zones. Each storage node can be placed in a Region.
A globally replicated cluster is created by deploying storage nodes in each Region. The proxy nodes will have an affinity to a Region and be able to optimistically write to storage nodes based on the storage nodes’ Region. Optionally, the client will have the option to perform a write or read that goes across Regions (ignoring local affinity),if required.
Example #1 – 2 Regions, 3 Replicas:
This configuration would be good for having two replicas in a single Region (the ‘primary’ location) and using another Region to have a single replica (the ‘offsite’ location).
When a client makes a request to the ‘primary’ Region, the proxy node will prioritize the storage nodes in its Region over remote Regions allowing for higher throughput and lower latency on storage requests.
For uploads, as normal, 3 replicas will be written in unique locations, but the proxy node will select 3 locations in the same Region. Then, the replicators will asynchronously replicate one of the 3 copies to the ‘offsite’ Region.
To provide a fully-replicated write at the expense of a higher-latency upload the client could force a concurrent write across the two Regions. Similarly, the client could use the current ‘fetch newest’ feature to force the proxy node to check timestamps of all replicas (including those from the ‘offsite’ Region) to determine the best copy to return.
Example #2 – 3 Regions, 3 Replicas:
This use case would be suitable for deployments where each geographic Region would be accessed equally. This would be ideal for multi-site archiving and filesharing applications.
Each site would accept writes and write 3 copies ‘as-unique-as-possible’ within the Region. Then, asynchronously, the object would be replicated to the other Regions. The client would be provided a more conservative write option to perform a concurrent write across all Regions.
A client would read from a local region and there would be an affinity to serve data from a storage node within the Region. A client would also be provided with the ability to ‘fetch newest’, where the proxy node would query each Region to determine the newest object based on their timestamps.
Example #3 – 2 Regions, 4 Replicas:
This use case would be useful for a dual-coast deployment where each location is accessed equally heavily. For example, a web/mobile application that is delivering application assets and content in multiple locations.
Would be similar to example #1 but allow for increased availability in each Region. In this use case, when the option to concurrently write to multiple Regions is successful, that application asset can be used immediately in the application globally.
Example #4 – Improve all Swift Deployments:
This feature will not only allow a globally distributed cluster, but it will also improve every Swift cluster deployed. Now, proxy nodes distributed throughout the data center can prioritize serving objects that are slightly nearby, for example by favoring storage nodes that are on the same network segment.
Replication traffic needs to be bandwidth-limited across WAN links, both for responsiveness and for cost.
This can be accomplished by adding two new fields to the devices in the ring – replication_ip and replication_port. For example, a device would have the following entries about itself in the ring:
1 2 3 4 5
Additionally, two new methods will be added to the ring.
ring.get_more_replication_nodes() will take advantage of the new replication_ip and replication_port fields. The replicators will be updated to use the new methods. QoS can be set appropriately for managing bandwidth to the offsite replicas.
When a WAN connection fails between two sites, Swift will treat this scenario similar to a ‘down’ node. A replication will not occur unless there is communication with the storage node and there is a failure reading data from a drive. When the WAN connection is reestablished, the same eventually-consistent replica management strategy based on timestamps and checksums will be used to determine the most recent object. When the WAN connection is reestablished, the regions will efficiently synchronize and geographic replicas will again be consistent.
To support adding and removing of Regions, Swift must let the operator adjust the replica count up and down. The ring builder will gain two new commands:
The operator will want to rebalance their rings prior to pushing them out, especially when removing a replica.
Managing a Globally Distributed Cluster with SwiftStack:
A distributed storage system is complex enough! We at SwiftStack take pride in our ability to manage systems storing data across multiple nodes and zones. With a globally distributed cluster, the management complexity will only increase. While Swift continues to evolve to support geographically distributed clusters, do keep in mind the additional tools required to manage and operate the cluster.
For a globally distributed cluster, we will continue to update our SwiftStack Controller product to provide a single interface for managing the entire cluster. We will continue to not only push Swift forward, but also build the tools necessary to manage, monitor, upgrade and scale Swift – not just in a single region, but also across geographically distributed datacenters.