I recently had an opportunity to speak at HIMMS17—the conference “where the brightest minds in health and IT meet.” According to the session abstract, my job was to share case studies for integrating object storage and leveraging metadata to advance the state of the art in genomics research. I thought it might be useful to capture some of what I shared at that conference here. If you don’t have time to read through the post, start with this four-minute video, which we also produced for the conference to summarize some of the key points I detail below.
Learning From Our Clients
As an area of intentional focus for us, SwiftStack has nearly a dozen clients in Life Sciences today and more than that in various stages of evaluation. We don’t pretend to be experts in genomics or research science, but we have collected a number of well-known and well-respected clients who have—quite frankly—taught us a lot.
We knew when we first entered the Life Sciences world that our cloud architecture had the promise of helping with scalability needs, and we have worked with our clients for over three years to understand their existing pain points and have continually enhanced SwiftStack’s software and business to more specifically and completely address that pain. Institutions like Fred Hutch, HudsonAlpha, OMRF, and Counsyl have spoken publicly about how SwiftStack has improved their workflow and their business by enabling increased research and/or revenue for each. We’ve found that every company or institution’s situation is a little different, but there are a handful of common challenges that exist across the industry; the top three seem to be scalability (in terms of capacity), performance (in terms of throughput), and collaboration (in terms of data sharing and transfer).
To detail SwiftStack’s value addressing each of these industry challenges would probably require another blog post, so I’ll summarize just the top three:
- Massive Throughput: At Fred Hutch, a SwiftStack cluster installed on some repurposed hardware they already owned outperformed their existing Isilon for throughput by 3x. (Funny enough, they initially called Isilon their “fast” tier of storage and SwiftStack their “archive” tier, but people are now questioning those names!)
- Multi-PB Scalability: Several of our Life Sciences clients are using SwiftStack software to manage several PBs of data.
- Collaborator Access: Both HudsonAlpha and Oklahoma Medical (OMRF) use container ACLs and TempURLs to give collaborators or clients access to objects directly from their SwiftStack clusters. Fred Hutch synchronizes some of their data to Amazon’s S3 buckets for easy external collaborator access.
So, how does SwiftStack fit into the workflows of our Life Sciences clients? We can usually break it down into one or more of the following four areas: Genome Sequencing, Scientific Computing, Collaboration & Distribution, and Backup & Recovery. Let’s look at each of them in a bit more detail.
Today’s genome sequencers—especially something like a HiSeq sequencer from Illumina—can produce over a petabyte of data each year. And while it’s amazing that the time and cost of sequencing a genome has dropped dramatically over the past few years, it has created a storage challenge.
When we first encounter most of our Life Sciences clients, the majority have a sequencing workflow that looks something like the top third of the picture above: The base call (BCL) files produced by your sequencers are buffered temporarily on the sequencer workstation and then moved off using Illumina’s Run-Copy-Service or similar to some kind of NAS system—often Isilon. From there, they get pulled into an HPC cluster for consolidation with Illumina’s bcl2fastq tool, and the resultant FASTQ file is pushed back to the same or another NAS system. If that’s not where you end, then the FASTQ file will get aligned against a reference genome in your HPC cluster using something like the Burrows-Wheeler Aligner (BWA), and you may even go to the point of variant analysis in your HPC farm as well—producing variant call files (VCFs) or other research data.
What we’ve heard again and again is that the throughput in and out of the NAS systems in that picture slows down the overall process and that they are “expensive, expensive, expensive.” So, SwiftStack can replace that expensive and underperforming Isilon.
For the admin-minded folks, recall that SwiftStack’s primary access protocols are the S3 and Swift APIs, so it’s not a drop-in replacement for your POSIX-compliant NAS systems—at least not yet. (Much like the adoption of files protocols like NFS and SMB/CIFS, we expect that most storage-heavy applications will be modernized in time to “speak” object APIs like S3 or Swift.)
For now, we have tools that can automate movement of data, and many clients build that automation in their LIMS systems (Lab Information Management System) as well.
In the orange text in the diagram, I mention a few of our tools: Specifically, SwiftStack’s “Watched Folder” tool can be set to automatically copy or move a file written on a NAS share into SwiftStack, and the “SwiftStack Client” is an easy-to-use GUI tool that allows for manual upload, download, and searching within a SwiftStack cluster.
So, if we think of the bottom third as the ideal situation—where a SwiftStack private cloud has eliminated the need for these various NAS systems, that may still be a stretch for most people to jump from the top to the bottom in one move. For one thing, the reality for many folks is that it doesn’t make sense to completely abandon your existing NAS investment right away. So, instead, many of our clients have taken a logical step in between, and that’s what we have drawn in the middle.
In essence, you can keep your existing NAS system in place and doing a subset of its current job. By archiving off the “finished” data, you can avoid buying additional NAS capacity, and you can probably even shrink the existing capacity as it ages or gets repurposed.
Then, in time, when you have modernized your workflow to interact with SwiftStack directly, you can remove the NAS system from the picture entirely.
Traditional storage architectures have been performance-optimized over time, but the focus has been on transactional applications like databases or virtual machines. There is definitely a need for that type of storage in certain places—like the scratch space in your HPC farm, but it is equally—and perhaps increasingly—important to be able to move large amounts of data quickly between your active archive and your compute environment.
SwiftStack’s architecture scales horizontally—enabling throughput performance to grow linearly with the growth of the cluster. And again, for the admin-minded folks wondering what specifically is required to implement this, think of it this way: Instead of your HPC cluster mounting a share directly on the NAS system to copy the remote data to /tmp or wherever, you just use one of a number of data mover tools to “stage” the data before processing and to “archive” the results when complete. Or, if you typically process directly against data on your NAS system, you can stage data from SwiftStack on your NAS and treat it just like a scratch space for your HPC cluster.
Collaboration and Distribution
An additional area where we have seen clients experience a lot of benefit is what I might call “collaboration and distribution.” For some, it might be one or the other or both. In the top half of this picture, I’ve depicted some—but certainly not all—of the ways I know folks in this industry move data back and forth internally and externally. And again, what we generally hear is that this is a pain, and it’s slow, and it can be expensive, and it may put important data in an insecure position.
One of the benefits of SwiftStack’s cloud architecture is that it can scale from one to many geographic regions—all in the same name space, and according to the policies you choose, SwiftStack will automatically replicate data between regions. So, for example, HudsonAlpha uses policies to ensure that BAM files uploaded in their Huntsville, Alabama facility are replicated to data centers with higher download bandwidth that are physically closer to their clients and collaborators.
Then, with SwiftStack’s ability to control access rules for accounts and containers and to even provide temporary URLs for download access to individual objects, remote clients or researchers can pull data directly from the SwiftStack cluster. Some of our clients, like Oklahoma Medical Research Facility (OMRF), have even built nice web portals and send automatic emails to clients with a download link when their sequencing results are ready for access. Alternatively, if S3 or Google is a better place to “meet in the cloud,” then SwiftStack’s Cloud Sync feature can automatically replicate data there.
The last use case I’ll mention is traditional IT backup, because it’s probably one of the simplest and most obvious.
SwiftStack enables a new way to protect data, and you do not need to redesign the way your backup application reads and writes data from its sources to drastically simplify the way that you store backup data onsite and offsite. In fact, the world’s leading backup software vendors—including Veritas and Commvault—have already implemented direct-to-cloud support in their backup applications.
Instead of managing many storage systems onsite and using tape to move data offsite to the mountain, cloud storage can be utilized to have a single namespace across multiple sites to simplify the backend storage infrastructure, remove a layer of storage, and leverage other data centers you’re already managing instead of paying for a offsite storage service.
Increasingly Important: Searchability
Finally, I wanted to mention one other thing we’re seeing begin to demonstrate real value in the Life Sciences world, and that is searchability within a scientific data archive. SwiftStack allows for the addition of custom metadata on each object that is stored, and when you’re getting into the world of millions and billions of objects in an archive, searching by relevant metadata is infinitely simpler than trying to organize some kind of massive folder structure or naming convention.
From a business perspective, this metadata search functionality has resulted in both cost-avoidance and monetization. At OMRF, for example, their metadata search strategy saves the cost of re-sequencing genomes by enabling a researcher to find out if a particular sample has already been sequenced before; if so, he or she can request access to the data instead of spending thousands of dollars and days to regenerate it. And for other clients, metadata search offers a new revenue stream by offering validation of specific results across the larger population of data in their archive.
Many institutions and businesses are still figuring out their strategies in this space, but it doesn’t take much effort to build up a collection of metadata in your LIMS system as you sequence and align a genome, for example: You could easily capture the scientist, sequencer, reagent, sample, date, time, reference genome, etc. and store that with a BAM file in your archive.
Example: Oklahoma Medical Research Foundation
And to put it all together in one example for you, this is essentially what Oklahoma Medical Research Facility is doing with SwiftStack today. Let me walk through this briefly to put all the pieces we just discussed together:
First, the initial business model and workflow where they implemented SwiftStack was for sequencing-as-a-service: They are given a sample to sequence, and they produce unaligned FASTQ files. Illumina sequencers generate BCL files on their workstations, and the Illumina run-copy-service moves those to a CIFS share on Isilon. From there, the HPC cluster pulls the BCLs to run bcl2fastq and produce that FASTQ file, which is pushed into their SwiftStack archive. All along the way, their automation system builds a collection of relevant metadata, and it writes that along with the FASTQ file into SwiftStack.
If this is for an internal researcher, they use simple data-mover scripts (they chose to write their own, but we have comparable tools that we can provide) to pull the FASTQ back to the HPC cluster for alignment and subsequent research. If this is for an external customer, then the FASTQ file is put into a private container in SwiftStack, and an email is automatically sent with a time-limited download URL.
OMRF also uses SwiftStack as the target for Commvault for backup of many of their servers and other systems in their data centers.
And, while it’s not in production yet, they are actively planning to redirect new small-animal MRI and high-end microscopy data into SwiftStack with searchable metadata just like their genome data.
Putting It All Together
So, if you take what OMRF has implemented and move the strategy forward a couple of logical steps, you get the picture above: SwiftStack can become your private cloud “Scientific Data Archive”—the center of your genomics sequencing pipeline, the repository for your research data, the access point for your collaborators, and a bridge even to the public cloud. For many of our clients, this has solved real limitations in their infrastructure and workflow; if you think our experience might be helpful to you, please let us know. We would be glad to learn from you and share what we have learned as well!