Architecting large-scale Artificial Intelligence/Deep Learning data pipelines
Businesses adopting AI/DL data pipelines for Computer vision, Natural language processing / understanding (NLP /NLU), Anomaly detection, often ask me how to best architect these environments. To answer this, let’s analyze these use cases, and their requirements.
Deep learning applications are inherently different from traditional applications
The accelerated compute needs of DL involve several hidden layers of deep neural networks like Convolutional neural nets or Long short term memory nets(CNN, LSTM, etc.), which need a massively parallel compute layer. This compute layer is typically implemented using GPUs for training and inferencing.
- GPUs for training include Tesla V100-based NVIDIA DGX-1, DGX-2 servers or DGX workstations; Cisco UCS C480ML; Dell C4140; or HPE Apollo
- GPUs for inferencing can be T4 GPU-based servers from all major server vendors.
Each Tesla V100 has around 5000 CUDA cores (threads) and around 600 Tensor cores (cores optimized for floating-point operations), whereas T4 has 2500 CUDA cores and around 300 Turing Tensor cores. Each of these servers have multiples of these GPUs. This is in stark contrast to CPU-based compute, where typically each CPU has only 24-28 cores.
Additionally, each of these servers come with large memory configurations (0.5 to 1 TB) and several TBs of NVMe Flash drives (16-32 TB and higher).
Creating the right datasets is a big part of the AI/DL pipeline. This includes techniques like Bayesian approximation (Active Learning) and use of open-source frameworks like XGBOOST with RAPIDS (Apache Arrow) for indexing, querying, ETL processing (data processing), model training, and visualization.
With RAPIDS the entire pipeline can be very effectively done in memory with GPUs. Consequently, data only needs to be read once by the pipeline, and written to persistent storage when done. RAPIDS/GPUs expect the storage layer to saturate the memory and keep them busy, with massive parallelism and throughput.
Deep Learning workflows are inherently different from traditional applications
Deep Learning workflows have varied storage I/O requirements at different stages, namely:
- Ingest from edge – write bandwidth and concurrency to handle multiple edge devices
- Reading, indexing and labeling of the data sets, and writing them back
- Massive read bandwidth for neural net training
- Inferencing where neural net is making predictions against newly ingested data sets
- Lifecycle management of these data sets used for training and inferencing
Again, this is in stark contrast to the contention-based (Read-Modify-Write) workloads or hierarchical namespace requirements typically seen when using distributed filesystems.
These workflows need integration with workflow engines and orchestration layers like Kubeflow, TFX (Tensorflow Extended), Valohai, NGC (NVIDIA GPU Cloud, based on open-source Argo) and AWS Sagemaker as well with K8s engines (CNCF opensource, Docker, Heptio, or Redhat Openshift). This enables an agile, performant end-to-end workflow of ingest, batch feature extraction, hyper parameter optimization, inferencing and versioning from a single pane of glass.
Lastly, these workflows need scale-out file APIs to object storage backend. Several of these workflows support native S3 APIs, reading directly from what is typically known as outer ring. Even frameworks like TensorFlow support S3 APIs. However, some workflows need scale-out file APIs (like POSIX NFS), typically clustered to create a distributed cache layer, using the NVMe Flash drives in GPU servers ( typically known as inner ring). This needs to be provided with a log-structured file system that can support multiple readers and writers, with a CSI plugin to provide persistent storage for Kubernetes pods, in an object-friendly way, much like what SwiftStack offers with ProxyFS.
AI/DL applications and workflows are inherently different from traditional file-based applications. The storage I/O requirements are different, and storage needs to be designed, architected, and deployed differently to meet these needs.
DRAM memory and Flash storage is needed primarily in the compute layer for RAPIDS-like frameworks and distributed caching, while the storage layer is expected to provide massive parallelism and throughput to keep the compute saturated. Data engineers architecting these pipelines, should pay close attention to these characteristics. In part 2 of this series, we will discuss how the current solutions fall short and how SwiftStack is ideally suited to meet these requirements.