Get started with the latest Amazon SageMaker services — Data Wrangler, Data Pipeline and Feature Store services — released at re:Invent Dec 2020. We also learn about the SageMaker Ground Truth and how that can help us sort and label data.
Get a head start in machine learning by learning how these services can reduce the effort and time required for you to load and prepare data sets for analysis and modeling. Data scientists will often spend 70% or more of their time cleaning, preparing, and wrangling their data into a state where it’s suitable to train machine learning algorithms against the data. It’s a lot of work, and these new SageMaker services provides an easier way.
The truth is out there, well, most of the time. In cases that it's not, use Amazon SageMaker Ground Truth. Let's understand why.
Labeled data is an essential ingredient for particular forms of machine learning, specifically supervised learning algorithms. During the training phase, the supervised learning algorithm will measure the accuracy of the model by generating predictions, and comparing them to a known label associated with the data. A typical example of this is image classification. When training an image classification model, labeled images are used, whereby each image contains one or many labels, indicating what is contained within the image, For example, a person, car, dog, cat, et cetera.
The MNIST, CIFAR-10, and ImageNet are all examples of public domain datasets that have already been labeled, and are often used for training. During the training phase, checks can be performed to see if the predictive classification performed on an image matches the associated label. Iterations or epochs of training continue until such time that the predictions reach a desired level of accuracy.
To date, the process of labeling has been time consuming, with limited tooling to aid the job. To help expedite and improve the experience, Amazon SageMaker Ground Truth has been added to the SageMaker portfolio. Amazon's SageMaker Ground Truth is a labeling service which provides both automatic and human workforce labeling features. With GroundTruth, you simply upload your unlabeled data sets into an S3 bucket, next, create your manifest file with pointers to each of the images, and place the manifest file within the same S3 bucket.
Using the Ground Truth console, create a Labeling Workforce. A Labeling Workforce represents the human workforce, who performs the labeling itself. There are currently three options: Public, A team of global on demand workers, powered by Amazon Mechanical Turk; Private, A team of workers from your organization; Vendor, A selection of experienced vendors that specialize in providing data labeling services.
Finally, we are ready to create a labeling job. A labeling job represents the actual labeling exercise that you need to be performed. The key configuration requirements that need to be specified are: Job Name; Input Dataset Location, and this is the S3 bucket location of the manifest file; Output Dataset Location, an S3 bucket location to receive the labeling data; the Dataset Object Selection, and this allows you to either label the entire dataset, a random sample, or filtered selection of the data; the Task Type, and you select a Task Type from a list of Task Types, including Image Classification, Bounding Box, Text Classification, Semantic Segmentation, or use your own Custom Task Type; Workers, and you can select the human workforce required to perform the job; and the Bounding Box Labeling Tool, where you configure the UI labeling tool that will be used by the workers, and this includes providing helper text in the form of instructions and guidance, et cetera.
Okay, so now the labeling job has been created, the chosen workforce will be invited to begin he process of labeling. Notifications are provided in the form of an email containing the URL to the Ground Truth labeling tool. If automatic labeling has been enabled for your job, Ground Truth will analyze and perform the labeling. Otherwise, the configured human workforce will use the Ground Truth tooling, and perform the labeling activity.
When the labeling job has been completed, the job owner or requester can visualize each image with its assigned label within the Ground Truth console. Finally, each image label is serialized back into the original manifest file, against the corresponding image.
Lectures
Introduction to SageMaker Data Wrangler - Getting Started with Data Wrangler - Setting Up SageMaker to Run Data Wrangler - Using Data Wrangler - Service and Cost Review
Andrew is fanatical about helping business teams gain the maximum ROI possible from adopting, using, and optimizing Public Cloud Services. Having built 70+ Cloud Academy courses, Andrew has helped over 50,000 students master cloud computing by sharing the skills and experiences he gained during 20+ years leading digital teams in code and consulting. Before joining Cloud Academy, Andrew worked for AWS and for AWS technology partners Ooyala and Adobe.