Distributed Machine Learning

Comments · 1921 Views

For my class project for CSCE678: Distributed Systems and Cloud Computing, we performed distributed training with TensorFlow and Horovod on Amazon Sagemaker for image classification of fruit dataset from Kaggle. I have highlighted the overall architecture, implementation steps, and test results in this blog.

 

Overall architecture 

The client converts image to base64 and posts request to API gateway. The lambda function is triggered, and it converts base64 image to png. The trained model from S3 is loaded, and the label is predicted inside lambda function. The label is sent back to request in response. 

ML Architecture Design1

 

Implementation Steps

We broadly had 5 major steps for our implementation process.

1. Modified our script to be distributed ready using Horovod library

We used Horovod library to make sure that our training script scales to train across many GPUs in parallel, i.e. our training script is GPU agnostic.  

2. Dataset 

We used a subset of Fruits 360 datasets from Kaggle. We focused on apples, kiwi and guava from the images within the set.

The training set consisted of 9600 training images, 1200 validation images and 1200 testing images.

3. Distributed Training using Amazon Sagemaker 

We used Amazon Sagemaker to run the training set at scale, with the ability to scale up and down GPUs as desired.

We launched the Amazon Sagemaker NB instance and defined the estimator with training script, location to save trained models, type of GPUs per instance, TensorFlow version, MPI distribution type. We specified paths to training, validation and test datasets in Amazon S3, and passed those parameters to estimators fit function. 

4. Monitoring 

We monitored the progress through Amazon Cloud Watch, and below is the snapshot of CPU utilization, memory utilization, and disk utilization. 

5. User Interface 

We designed a web application to serve the users to use the machine learning model using React, Angular.js, and EC2 Beanstalk. 

Results

Once training was complete, Sagemaker automatically uploaded training artifacts such as trained nodes, checkpoints, and tensorboard logs into our S3 bucket and we achieved a test accuracy of 90.72% 

                                

 

References:

A Quick Guide To Distributed Training with Tensorflow and Horovod on Amazon Sagemaker

 

Comments