Recently, we posted an article about the criticality of drive failure in a distributed system vs a traditional storage array. Another component that can cause challenges if it fails is the system’s management controller, right? In a modern, distributed system, it may not.
A quick recap on the resiliency of a distributed storage system: once the system is set up and is running smoothly, when a disk—or even an entire node—fails, the cluster will work around these issues and continue operating with minimal interruptions until you have a chance to replace the failed component. This is a key characteristic that makes distributed systems awesome—they are extremely tolerant to sporadic failures like these.
The only component that is often not distributed is the management controller. So what happens if the management capabilities of a distributed system fail or become temporarily unavailable?
Management of a Distributed System
For a modern, distributed storage system, if the management controller happens to fail, not much happens at all! With many of these modern systems, the “control plane” has been deliberately and carefully designed to stay out of what is called the “data plane”, or the part of the system that actually handles the data. The control plane handles the configuration and ongoing management of the system components and is not directly involved in the management (storing, movement, serving up, etc.) of data.
Looking at it one level closer, what the management controller mostly does is manage the configuration of your distributed storage cluster and the features and functionality offered by the system. This includes making sure all of the data is laid out properly, the middleware is in the right order, options are configured properly, the software running on each component is up-to-date, and so on. Essentially, the management controller sets up the storage cluster for success, and then lets it run.
Downsides of a Management Controller Failure
There are, of course, downsides to the management controller going down. You lose access to the wealth of monitoring information that it exposes—no more graphs showing how much disk space is free, how system rebalances are going, or how many requests are coming in at different times. You also lose access to alerts about problems in the cluster, which might make it easier to let a minor problem like a few disks going bad turn into a problem large enough in scope to cause trouble even for a robust, widely-distributed storage system.
Also, any ongoing capacity adjustments will be put on hold—the management controller orchestrates changes to the topology of the system, and most often ensures changes are made incrementally in order to avoid huge storms of data movement that can negatively impact performance from an application’s perspective. For example, in SwiftStack, it is not possible to add capacity to a cluster when the management controller goes down, because new system-wide configurations cannot be pushed out, in particular to new nodes. Fortunately, a down management controller does not affect data availability and durability in the near-term and is not the end of the world!
Product Example: How the SwiftStack Controller Operates
In the SwiftStack Platform, the “control plane” is called the SwiftStack Controller. It runs on a separate server in the system, orchestrates the setup and ongoing maintenance of a SwiftStack storage cluster, manages policies that place data across data centers and clouds (1space), and provides the administrator with reporting data and analytics.
The nodes of a SwiftStack cluster also run an agent that reports back to the SwiftStack Controller. This agent tries to keep everything healthy on the node to the extent it can, which blurs the lines a little bit between the control and data planes, but it only crosses into the data plane to restart critical services that are down. Its primary duties are monitoring the health of the node and enabling the SwiftStack Controller to do its job. For example, when you push a new configuration out to the cluster, the node agent is what actually writes the files and reloads services to make any changes live.
The actual data management is done in the data plane by the SwiftStack Cluster. The SwiftStack Storage is a complex system made up of a number of different services running across your nodes that all coordinate to store your data durably and serve it up to applications reliably and promptly. Once the cluster is up and running, it will mostly stay up and running. When individual parts fail, like a drive or a server, the rest of the system keeps humming along, and access to your data remains.
Management Controller Recommendations for High Availability
Even though management controller failure does not cause an immediate, critical event, it is still best-practice to design the environment so it is up-and-running most of the time. Keeping the management controller highly-available is handled differently for different distributed storage systems. Some management systems require highly-available virtual infrastructure to run, some are hosted as-a-service in “the cloud”, where the vendor is responsible for high-availability, and some are architected to fail over to another running instance.
SwiftStack has two primary options for running the SwiftStack Controller while keeping it highly available—as-a-service or running it behind the firewall on two independent servers. For the on-premises option, the recommended setup includes configuring a warm standby. This secondary SwiftStack Controller will have all the relevant data from the primary SwiftStack Controller and that data is synced regularly, so if something happens to the primary Controller, the standby can take over all duties, allowing you full operational control over your storage cluster once again, with little management downtime.
If you would like to continue the conversation around management controller configuration or distributed storage design, we have an excellent team of solutions engineers who are here to assist.