SHARE   

Splunk Enterprise protects mission-critical applications, and in doing so, it has become a mission-critical application that must be protected and factored into every disaster recovery (DR) plan. Since SwiftStack is a storage provider, this article will focus on protecting the data sets within Splunk. With Splunk’s SmartStore feature – first introduced in Splunk Enterprise 7.2 and enhanced in Splunk Enterprise 7.3 – it has become much easier to sustain continuous data availability for Splunk indexes even as serious infrastructure failures occur.

Let’s tackle the basics first. Splunk SmartStore enables a two-tier architecture for running Splunk Enterprise. There’s the indexer tier, which is sized to the data ingest rate, and the scale-out storage tier, which is sized to the data retention requirement. Compute and storage are allocated to these functions accordingly. This right-sizing of resources will reduce the cost and complexity of the underlying infrastructure, which is the purpose of SmartStore.

Splunk Warm Buckets Take Center Stage with SmartStore

With SmartStore, Splunk Enterprise continues to use Hot and Warm buckets to hold data, but this model is constructed in a more efficient way. Hot buckets store active data on Flash media in the indexers. As Hot buckets roll to Warm, these buckets are stored in a cache on that same Flash storage.  The new, and most significant piece of the SmartStore architecture, is how Warm buckets are also immediately written to S3-compatible scale-out storage, which can be deployed in a single site or across multiple sites (see Figure 1).

The Warm buckets remain in indexer cache for a short period of time  – days to maybe a few weeks – and then age out. At this point, the master copy of the Splunk data resides in Warm buckets on scale-out storage, such as SwiftStack. It can stay there for months or years, depending on the retention period setting, and is fully searchable.

Splunk SmartStore Architecture with SwiftStack

Figure 1: Splunk SmartStore creates a set of Warm buckets in S3-compatible scale-out storage, like SwiftStack, which can be deployed in a single site or across multiple sites.

This background brings us to an essential point about data protection for SmartStore indexes. Splunk’s replication feature continues to protect the Hot buckets, but the scale-out storage is responsible for ensuring data durability for the master set of Warm buckets. The storage system’s high availability (HA) features are applied to both single-site and multi-site environments, providing resiliency against incidents ranging from a server to a site failure.

With SmartStore, the indexers are now decoupled from the data they search in those Warm buckets, creating a stateless model. We’ll revisit that point shortly. For now, let’s examine how a SmartStore-enabled architecture provides continuous data availability for Splunk Enterprise when an unexpected event happens.

How a Site Failure is Handled by SmartStore and SwiftStack

Figure 1: Splunk Data Redundancy in a Site Failure Scenario

Figure 2: In a multi-site configuration of SwiftStack storage, Splunk indexes remain online even after a complete site outage occurs.

It’s a worst-case scenario for IT. The primary data center has experienced a major outage. Even though the DR plan has been tested and audited, there’s major anxiety among the team.

Let’s set the context more specifically for Splunk Enterprise and its data. The Splunk indexer cluster and the SwiftStack storage nodes that are located in site #1 are unreachable. However, an additional set of SwiftStack storage nodes located in site #2 is online and ready for I/O requests (see Figure 2). SwiftStack’s built-in data replication scheme ensures all Splunk indexes that have been written to site #1 are available in site #2. Further, since a multi-site SwiftStack system functions as a unified pool of storage, it’s addressable through a single path name.

Figure 3: Splunk Indexers are Rebuilt through a Streamlined Process

Figure 3: New indexers can be provisioned from the metadata in the Splunk Warm buckets, which has been protected across sites by SwiftStack replication.

This is where the stateless nature of SmartStore indexes comes into play. Even with the primary site offline, a new indexer or cluster of indexers can be started and pointed to the scale-out storage in site #2, which remains fully operational (see Figure 3). From there, the indexer “bootstrapping” process fetches metadata from the Warm buckets and updates the .bucketManifest file, Splunk’s master catalog of all buckets in an index. Since the metadata is very small, bootstrapping will complete very quickly. The indexer is back in service

Note: bootstrapping a new indexer can be beneficial for more purposes than just DR. Examples include troubleshooting or upgrading a server in an otherwise healthy indexer cluster.

Splunk Searches Continue when Data is Protected by SwiftStack

At this point, searches can commence (see Figure 4), with the actual data payload being uploaded from the Warm buckets on SwiftStack into the Splunk indexers as requests are made. Continuous data availability has been achieved. Of course, the “new” indexers can also be used to ingest data, with the workflow resembling a single-site version of Figure 1.

Figure 4: Splunk Searches can be Performed after a Site Failure

Figure 4: Newly provisioned indexers can search all Splunk data in the Warm buckets, even as one site is still inactive.

A Seamless Return to a Normal State

We’ll complete the site failure scenario with a brief description of what happens with SmartStore indexes and SwiftStack storage when site #1 is restored. It’s pretty straightforward. Spunk Enterprise operations will resume and SmartStore will write data to SwiftStack as buckets roll from Hot to Warm. Searches will be performed as usual. In the background, SwiftStack’s replication engine will synchronize any data that has been written to the nodes in site #2 while the nodes in site #1 were out of service (see Figure 5). The complete Splunk Enterprise environment has been reconstituted.

Figure 5: All Splunk Infrastructure Components are Operational

Figure 5: Data will be synchronized between the storage nodes across the sites, all while searches are conducted against the full data set.

One Final Point about the Public Cloud

Some people reading this blog are likely saying, “that process sounds slick, but we don’t have a 2nd data center or the equipment to carry out a recovery in this way.” The good news is that Public Cloud IaaS (eg, Amazon Web Services (AWS), Microsoft Azure, or Google Cloud) can be your secondary data center. Here’s how. SwiftStack “1space” can replicate data to Public Cloud storage (ie, an AWS S3 bucket) and Splunk indexers can run on Public Cloud compute (ie, EC2 instances). So, the configuration we’ve specified and the steps we’ve documented still hold true, only with Public Cloud storage and compute resources taking the place of “site #2”. Again, continuous data availability is achieved.

For questions about how the concepts in this blog apply to your Splunk Enterprise environment, feel free to email us at splunk@swiftstack.com. Or, visit our booth at .conf19.

Technical note: the process described in this blog was verified in SwiftStack’s lab by the Solutions Engineering team, notably, Anup Pal.

Meet with a SwiftStack Expert at Splunk .conf19 in Las Vegas!

Co-Author: Eric Rife is the Director of Solutions Architects at SwiftStack.

erife@swiftstack.com

About Author

Greg Govatos

Greg Govatos

Greg Govatos is the VP of Strategic Partnerships for SwiftStack.