SHARE   

SwiftStack Announces Data Analytics Solution with 10x Performance and Massive Scale

Digital transformation and IoT related use cases are forcing analytics to move from a batch descriptive mode to actionable predictive and prescriptive mode. Additionally technology advances with accelerated compute, availability of lots of data and open sourcing of deep learning frameworks, is leading to convergence of  Data Analytics and  Cognitive analytics ( AI / deep learning ) use cases. Data analytics is increasingly being used for ETL and data transformation as part of AI and deep learning pipelines. Enterprise customers are asking for multi-cloud or hybrid data lakes spanning on-premise and multiple public clouds, to provide best of breed infrastructure and prevent vendor lock-in, while data as a strategic asset is been stored on-premise.

Scale and performance needs are forcing users to look beyond HDFS

MapReduce and HDFS has served the batch descriptive use cases well, however with market transitioning, Spark and Presto in-memory frameworks are becoming the norm. On the storage side there is a need to provide more performance, scale and durability, multi-cloud workflows and better economics for managing the data deluge.

SwiftStack for large Perforce CI/CD pipeline

As a proof point, one of our customers leveraging large Perforce CI/CD pipelines was using HDFS to serve 5000 builders and 4000 testers. They were having issues with FUSE mounts to HDFS and decided to move to SwiftStack, with clients leveraging S3 APIs. SwiftStack has already been involved with these use cases and has quickly adopted the market transitions.

S3A advantage

Customers have been supporting Spark deployments, with SwiftStack using the S3A connector for some time now. S3A is an open source connector for Hadoop / Spark and allows users to read and write data to SwiftStack using S3 APIs. S3A provides faster performance for large files, provides parallel upload, partial reads are supported, without having to download the entire file, providing performance gain, as well as copy and rename capabilities.

SwiftStack and Alluxio integration for broader analytics use cases

Though SwiftStack has been supporting Spark deployments using the S3A connector, there are
a) metadata heavy use cases or
b) where customer is cloud bursting Spark and compute infrastructure in the cloud or
c) when support for broad analytics APIs is needed, the Alluxio and SwiftStack solution becomes paramount.

This solution is built for multi-cloud data lakes, using popular frameworks and applications like Hadoop, Spark, Presto, TensorFlow, Hive and delivers ten times the performance with a cost-effective workflow spanning on-premise and cloud resources. The solution enables users to create a high-performance, at scale big data analytics or AI/ML data pipeline in a “memory-first” architecture. Storage and compute are decoupled and can scale on demand, to billions of files and hundreds of petabytes, as data loads and/or performance needs grow.

The SwiftStack solution is powered by SwiftStack’s object cloud storage with 1space multi-cloud data management and Alluxio data orchestration layer that sits between compute frameworks and storage. The solution eliminates common challenges with analytics applications today, including a lack of enterprise-ready multi-cloud workflows, insufficient throughput, or insufficient API compatibility for Spark RDD’s ( Resilient distributed datasets) , DataFrames, Presto and modern analytics frameworks.

There are 3 different use cases where the solution can be deployed –

  1. HDFS off-load to SwiftStack, or co-existing with HDFS to extend performance and scale
  2. Cloud bursting with compute in multi-cloud (AWS, GCP etc.) with Alluxio
  3. Cloud native applications based on S3 APIs

Integrating SwiftStack analytics solution in your stack

Mounting SwiftStack

$ ./bin/alluxio fs mount \

–option swiftstack.accessKeyId=<SWIFTSTACK_ACCESS_KEY_ID> \

–option swiftstack.secretKey=<SWIFTSTACK_SECRET_KEY_ID> \

alluxio://master:port/s3 s3a://<S3_BUCKET>/<S3_DIRECTORY>

Reading and Writing from Spark using RDD and Dataframe api’s

// Using Alluxio as input and output for RDD

scala> sc.textFile(“alluxio://master:19998/Input”)

scala> rdd.saveAsTextFile(“alluxio://master:19998/Output”)

// Using Alluxio as input and output for Dataframe

scala> df = sqlContext.read.parquet(“alluxio://master:19998/Input.parquet”)

scala> df.write.parquet(“alluxio://master:19998/Output.parquet”)

 

Summary

Customers increasingly want to derive actionable intelligence from their stored data sets and SwiftStack Data Analytics solution provides an effective way to meet these requirements. Customers looking to modernize their existing HDFS data-stores or building analytics solution on top of their existing SwiftStack deployments, should be seriously looking at this solution. The SwiftStack and Alluxio joint webinar is a good way to get more insights or connecting with us at info@swiftstack would get you started. Alluxio slack channel is available at alluxio.org/slack

About Author

Shailesh Manjrekar

Shailesh Manjrekar

Shailesh has deep experience in infrastructure across storage and networking (EMC, NetApp, HGST, Brocade). As a thought leader on the how AI/ML/DL impacts infrastructure, Shailesh has worked on changes needed in the datacenter, the edge and the cloud. Shailesh serves as the Head of Product and Solutions Marketing for SwiftStack.