Splunk Enterprise protects mission-critical applications, and in doing so, it has become a mission-critical application that must be protected and factored into every disaster recovery (DR) plan. Since SwiftStack is a storage provider, this article will focus on protecting the data sets within Splunk. With Splunk’s SmartStore feature – first introduced in Splunk Enterprise 7.2 and enhanced in Splunk Enterprise 7.3 – it has become much easier to sustain continuous data availability for Splunk indexes even as serious infrastructure failures occur.
Let’s tackle the basics first. Splunk SmartStore enables a two-tier architecture for running Splunk Enterprise. There’s the indexer tier, which is sized to the data ingest rate, and the scale-out storage tier, which is sized to the data retention requirement. Compute and storage are allocated to these functions accordingly. This right-sizing of resources will reduce the cost and complexity of the underlying infrastructure, which is the purpose of SmartStore.
Splunk Warm Buckets Take Center Stage with SmartStore
With SmartStore, Splunk Enterprise continues to use Hot and Warm buckets to hold data, but this model is constructed in a more efficient way. Hot buckets store active data on Flash media in the indexers. As Hot buckets roll to Warm, these buckets are stored in a cache on that same Flash storage. The new, and most significant piece of the SmartStore architecture, is how Warm buckets are also immediately written to S3-compatible scale-out storage, which can be deployed in a single site or across multiple sites (see Figure 1).
The Warm buckets remain in indexer cache for a short period of time – days to maybe a few weeks – and then age out. At this point, the master copy of the Splunk data resides in Warm buckets on scale-out storage, such as SwiftStack. It can stay there for months or years, depending on the retention period setting, and is fully searchable.
This background brings us to an essential point about data protection for SmartStore indexes. Splunk’s replication feature continues to protect the Hot buckets, but the scale-out storage is responsible for ensuring data durability for the master set of Warm buckets. The storage system’s high availability (HA) features are applied to both single-site and multi-site environments, providing resiliency against incidents ranging from a server to a site failure.
With SmartStore, the indexers are now decoupled from the data they search in those Warm buckets, creating a stateless model. We’ll revisit that point shortly. For now, let’s examine how a SmartStore-enabled architecture provides continuous data availability for Splunk Enterprise when an unexpected event happens.
How a Site Failure is Handled by SmartStore and SwiftStack
It’s a worst-case scenario for IT. The primary data center has experienced a major outage. Even though the DR plan has been tested and audited, there’s major anxiety among the team.
Let’s set the context more specifically for Splunk Enterprise and its data. The Splunk indexer cluster and the SwiftStack storage nodes that are located in site #1 are unreachable. However, an additional set of SwiftStack storage nodes located in site #2 is online and ready for I/O requests (see Figure 2). SwiftStack’s built-in data replication scheme ensures all Splunk indexes that have been written to site #1 are available in site #2. Further, since a multi-site SwiftStack system functions as a unified pool of storage, it’s addressable through a single path name.
This is where the stateless nature of SmartStore indexes comes into play. Even with the primary site offline, a new indexer or cluster of indexers can be started and pointed to the scale-out storage in site #2, which remains fully operational (see Figure 3). From there, the indexer “bootstrapping” process fetches metadata from the Warm buckets and updates the .bucketManifest file, Splunk’s master catalog of all buckets in an index. Since the metadata is very small, bootstrapping will complete very quickly. The indexer is back in service
Note: bootstrapping a new indexer can be beneficial for more purposes than just DR. Examples include troubleshooting or upgrading a server in an otherwise healthy indexer cluster.
Splunk Searches Continue when Data is Protected by SwiftStack
At this point, searches can commence (see Figure 4), with the actual data payload being uploaded from the Warm buckets on SwiftStack into the Splunk indexers as requests are made. Continuous data availability has been achieved. Of course, the “new” indexers can also be used to ingest data, with the workflow resembling a single-site version of Figure 1.
A Seamless Return to a Normal State
We’ll complete the site failure scenario with a brief description of what happens with SmartStore indexes and SwiftStack storage when site #1 is restored. It’s pretty straightforward. Spunk Enterprise operations will resume and SmartStore will write data to SwiftStack as buckets roll from Hot to Warm. Searches will be performed as usual. In the background, SwiftStack’s replication engine will synchronize any data that has been written to the nodes in site #2 while the nodes in site #1 were out of service (see Figure 5). The complete Splunk Enterprise environment has been reconstituted.
One Final Point about the Public Cloud
Some people reading this blog are likely saying, “that process sounds slick, but we don’t have a 2nd data center or the equipment to carry out a recovery in this way.” The good news is that Public Cloud IaaS (eg, Amazon Web Services (AWS), Microsoft Azure, or Google Cloud) can be your secondary data center. Here’s how. SwiftStack “1space” can replicate data to Public Cloud storage (ie, an AWS S3 bucket) and Splunk indexers can run on Public Cloud compute (ie, EC2 instances). So, the configuration we’ve specified and the steps we’ve documented still hold true, only with Public Cloud storage and compute resources taking the place of “site #2”. Again, continuous data availability is achieved.
Technical note: the process described in this blog was verified in SwiftStack’s lab by the Solutions Engineering team, notably, Anup Pal.
Co-Author: Eric Rife is the Director of Solutions Architects at SwiftStack.