When I was looking for code examples of loading data into TensorFlow via the S3 API, the results were pretty sparse (with lots of questions about various error messages). I thought to myself, “someone should really write a tutorial on using S3 data in TensorFlow!” Then, I realized that someone was me: and that is the origin story for this article!
What we’re going to do is load some data from an S3 API data source, preprocess it into a format usable for training, and review the data we have loaded. We’ll use everyone’s favorite dataset MNIST for speed and simplicity. We’ll show you two different methods for leveraging S3 data for AI/ML training:
- Tensorflow’s built-in S3 support, which allows file-like access to S3 API resources
- Boto3, a python library allowing for more granular manipulation of S3 resources
Between these two approaches, you will learn how to load data directly into memory for training, which is a faster approach for single-use datasets, as well as how to stage the data locally, which is useful for performing multiple training sessions, for example for Hyperparameter tuning.
We’re using Google Colab https://colab.research.google.com/, because Colab is great, but you can also download the notebook and run it in jupyter on your own system if you prefer.
If you want to set up your own Object Storage bucket (i.e. container) for testing, you can use our Docker container here: https://hub.docker.com/r/swiftstack/picoswiftstack
Pico SwiftStack provides an S3-compatible object storage platform in a docker container for testing purposes.
We also provide a nice GUI client that works with S3-compatible storage here: https://www.swiftstack.com/downloads
Otherwise, we provide a hosted dataset that will let you run through the notebook without modification.
I’m not going to go in depth on what Object Storage is, or why it’s the ideal storage medium for Petabyte-scale AI/ML workloads in this post. Let’s just say that by eschewing a lot of the baggage of traditional storage platforms designed to be consumed by an Operating System (directory tree structure and associated file-system metadata overhead, limited parallelization required to support block-level access, locking and file integrity issues, etc.) in favor of a storage platform designed to be consumed by Applications (via an API), you are able to achieve a degree of scalability and concurrency that is difficult—if not impossible—to achieve on traditional storage platforms.
This becomes relevant when you start dealing with large training datasets on the order of a petabyte or more, which we are beginning to see in several verticals now, e.g. Medical Imaging, Oil and Gas, and especially Autonomous Driving. Predictably, many large-scale AI workloads are moving to Object storage after reaching scalability and performance limitations (or economics concerns) on traditional platforms. A great example of this is Nvidia’s MagLev—an end-to-end Autonomous Vehicle platform.
High-level overview: https://developer.nvidia.com/gtc/2019/video/S9649
Engineering overview: https://developer.nvidia.com/gtc/2019/video/S9787
I’ve tried to be quite verbose in terms of explaining what we are doing every step of the way: I’ve included code comments and links to documentation and/or original code snippets where appropriate, and I hope this will provide you with everything you need to understand what we are doing and why we are doing it. If anything is unclear, if you have any questions or feedback on any of this (including questions related to Object Storage for AI/ML workloads), or if you just want to say hello, you can feel free to email me at firstname.lastname@example.org, or contact me on twitter at @j0nkelly.
Try it yourself on Google Colab (click on ‘open in colab’ at the top): https://gist.github.com/jonkelly/fb2f2b45553ed541238e23e8509af69c
You’ll also find code snippets there, and our next update will include a training/validation walkthrough as well!