Having been in the storage industry for many years, I am amazed at the opportunity that data has in shaping business transformation – Want to build and monetize new business models? Glean predictive insights? Enable competitive differentiation? I’d like to share some thoughts that I’m seeing shape this dynamic space.
Also, if you are going to be at the Global Artificial Intelligence Conference in Santa Clara next week, please come check out my talk on “Edge to Core to Cloud” on Jan 23rd’19 and say hi! SwiftStack is one of the sponsors for this event.
Parallelism enables AI
Digital Transformation and Edge Computing are leading to a deluge of data and deriving actionable intelligence from this data, is becoming a foremost challenge. Technology advances in the form of GPU computing, have enabled massive compute power and parallelism. This has in turn enabled the use of Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning (DL) data pipelines, which can derive actionable intelligence from data. These data pipelines derive value from data, and their accuracy and effectiveness can be enhanced or constrained by the underlying data infrastructure. However infrastructure scale, throughput and concurrency, agility, cost of ownership, and cloud technology maturation are key concerns for businesses interested in their use.
SwiftStack is parallelized storage
SwiftStack is highly-parallelized storage ideal for AI/ML/DL pipelines and workflows. SwiftStack enables these multi-stage workflows where data is continuously ingested, transformed, harnessed to develop and train new models, and then used to derive inferences. Each of these stages have distinct storage requirements for bandwidth, mixed read/write handling for both small and large files, scale, concurrency, meta-data labeling and more.
Stay tuned to this blog (and our website https://www.swiftstack.com/solutions/ai-ml) for more on how SwiftStack provides an ideal solution for distributed, at scale AI/ML/DL data pipelines.
In the meantime, when evaluating storage infrastructure consider the following in your storage infrastructure:
Does it offer enough ingest throughput and concurrency at petabyte scale?
Advanced processors such as NVIDIA GPUs provide massive parallelization and petaFLOPS of compute speeds, but does the storage I/O match those abilities? If it doesn’t, you most likely will not fully utilize the available compute resources. Lengthy training cycles result in longer time-to-value as data scientists are waiting longer for results.
Does it offer multi-protocol, high throughput ingest options?
Does the storage system provide a choice of ingest protocols like POSIX-compliant native NFS/SMB for enterprise analytical applications and S3/Swift for cloud-native analytical applications?
Is it metadata-enabled?
Since metadata is a rich and often untapped source of business insight, the storage should support tasks like labeling, search, and contextualizing data.
Is all the data accessible and agile?
When data spans multiple petabytes, multiple geographic regions or sites, and multiple clouds, AI/ML/DL workflows need to reach out and touch all of them. A scale-out global namespace instead of distinct silos makes it possible to access massive data sets, and is simpler to manage.
Is it container-ready?
Distributed pipelines leverage containers and orchestration engines like Kubernetes. Does the storage layer integrate with these layers?
Is it cloud-native to allow edge-to-core-to-cloud workflows?
Traditional controller-based block storage stacks and distributed file storage stacks quickly become choke points in these new workloads. Object storage systems are better equipped for massive parallelism, scale, and cost of ownership, while enabling edge-to-core-to-cloud workflows.
Does it provide best TCO at petabyte scale?
Is the storage software defined, providing a choice of best of breed hardware and multi-cloud accessibility options. Is the storage policy based, capable of moving data to the optimal tier, as per the workflow?
Think data first!
Barriers to effective AI/ML strategy tend to start with feeding data to the applications, but certainly extend to other data management and economic issues. There are storage growing pains associated with these workflows so a system should be architected to scale affordably and to make the best use of on-premise, cloud and multi-cloud resources today and in the future as more complex workflows are deployed.
Looking forward to sharing more updates with you all, on this topic.