1. Home
  2. Training Library
  3. Amazon Web Services
  4. Courses
  5. Introduction to the Principles and Practice of Amazon Machine Learning

CSV Cleanup


Problem types
Working with Data Sources
Data Manipulation Within Amazon Machine Learning
Working with Machine Learning Models
Start course
2h 12m

When we saw how incredibly popular our blog post on Amazon Machine Learning was, we asked data and code guru James Counts to create this fantastic in-depth introduction to the principles and practice of Amazon Machine Learning so we could completely satisfy the demand for ML guidance within AWS.

If you've got a real-world need to apply predictive analysis to large data sources - for fraud detection or customer churn analysis, perhaps - then this course has everything you'll need to know to get you going.

James has got the subject completely covered:
  • What exactly machine learning can do
  • Why and when you should use it
  • Working with data sources
  • Manipulating data within Amazon Machine Learning to ensure a successful model
  • Working with machine learning models
  • Generating accurate predictions

In this demo, we will download some sample data prepared for use with Amazon ML and then add it to S3. So I've started by opening Chrome and navigating to a website where we can download some sample data. This is the UCI Machine Learning Repository, and it's a great site to download free data sets. It's not the only site, but it'll suit our purposes well. So we'll start by clicking on the "View All Data Sets" link. And we'll scroll down over here to the business area. Click that. And next, we'll choose bank marketing. And we can see some basic data about this data set. It's multi-variant.

The associated task is classification. And as we'll see, it's specifically a binary classification. There are 45,000-plus instances, or rows, in the file, and 17 attributes per row. And we can download this data by clicking the data folder link. And when we click the data folder link, we can see that there are two zip files to choose from. The one we want is the bank zip file, so I'll click that and download. And when the download completes, we can see it in the download folder. Go ahead and open it to extract the files.

And inside the folder, we've got three files. The file that we're interested in is the large file, the bank full CSV.

That's our data. So let's take a look at it in our text editor, just to get a feeling for what that data looks like. So I will just go ahead and click open. And it opens in the text editor.

And we can increase the size here. This file has a header row, and that's not a problem for Amazon ML. But we need to make sure that this data conforms to the data format accepted by Amazon Machine Learning. So let's visit those requirements in our browser and check our file against them. I've already got the tab open here. It's part of the machine learning developer guide on the Amazon website. This gives us our point by point details about the data format required by Amazon. So first of all, it has to be plain text using ASCII, Unicode, or EBCDIC. And I think we're fine there with our file. The observations are one observation per line, and again, as we saw when the file was open, that that's... that requirement's being met. And each observation is divided into attribute values separated by a comma delimiter.

Now, here we see our first potential issue with the file. Amazon requires that the file is comma-delimited, and as we can see in our file, it is actually delimited by semicolons. So we'll definitely have to do at least that level of cleanup in order to use this file with Amazon. Let's go on.

Each observation must be terminated by an end of line character. And we're fine there. And the attribute values themselves cannot include end of line characters.

Now, this is important, because some data might have line breaks within a field, and that's not allowed with Amazon. So I don't think we have that issue with our file, but if we did, then that would be something else that we would need to clean up by replacing line endings with some other character, or just removing them entirely. Every observation must have the same number of attributes and sequence attributes. So this just means that our file needs to have the same number of columns on each line, if you think of it as a table. That the data could potentially be null, but the columns need to line up. And the CSV file has to have a consistent schema from top to bottom, every row representing the same data in the same place. And finally, each observation itself must be no larger than 10 MB. It's important to remember that when they're talking about observations, we're referring to individual rows. We're not talking about the total size of the file. So one row cannot be 10 MB, but the whole file can easily have many more than that. And finally, make sure that you take note of this last part, that Amazon ML will reject those rows, and if too many of them are rejected, then the whole file gets dropped and we can't use it. So we can- the important thing is to know what these rules are and clean up your file appropriately before submitting it to Amazon ML. And there's actually one more requirement that is not strictly a requirement of the data format, because this, after we did some cleanup, this would comply with those requirements. It's a requirement of the type of classification that we might use this file for, which is binary classification. And binary classification requires that the target value, meaning that the value that we're looking for, is described in terms of zeroes and ones. In this case, the Y column, the last value in the row, indicates the label for the target value.

And as we can see in the first observation that that value is actually a no. And so what Amazon ML would expect is that this value is a zero. And since it says no or yes, that's something we're going to have to clean up within our data preparation step in order to produce a file that'll be useful to us as part of Amazon ML. So let's clean up this file. So I've created a new PyCharm project called "banking" and it's currently empty. The first thing we'll want to do is bring our data over into the project so that we have access to that source data. So I've got the project folder here in the finder, and I'll open a new finder window, and we'll visit our downloads. And in our download folder, there's the banking CSV that we just looked at, so I'll just copy that here into the banking folder. And I'll actually create a new folder called "data" and I'll put the file in there. So once we've done that, we'll go ahead and create our first Python file, and we'll just call it "cleanup.py." And we can see that once we've got that first file, then we can see our data over here, as well. And the first thing we'll need to do is import the CSV module. It allows us to work with delimited files of any kind. We can specify the delimiter for Python. Next, we'll go ahead and define the data path as a variable, and that'll be at data/bank-full.csv.

Next, we'll open a file handle for that file and call it content. And the following line, we'll create a CSV reader, give it that open file handle, and we'll indicate that, in the input file, our delimiter is a semicolon and that our quote character is a double quote. That's a little bit hard to read, but what it actually is is single quotes and a double quote inside. So that reflects the data that we actually have. And with this information, the CSV reader should be able to read out that data correctly. And to read all of that data into our program, we'll use a Python generator expression here. We're saying r for r in data. That means for every row in the data, we want to process that and place each row into this new variable called rows. So now that we have our data read into our program, our next step will be to write it out into the clean file. So indicating a new file in the data folder, cleanup.csv, which we'll create by opening that file for writing, and we'll just call it content again. That name's not super important.

This time, our data handle will be csv.writer instead of csv.reader. We'll give it that open output handle and we'll tell it that we want the minimal amount of quoting. This argument should be enough to let the CSV module know that we need to quote any field that has a comma in it, and that should get us compliant with the requirements from Amazon Machine Learning.

You'll notice that we did not specify a delimiter. We did not specify a quote character. We're leaving those defaults in place, which will give us a default delimiter of comma. Which, of course, is what we needed to do in order to prepare this file for use with Amazon ML. And the last part of our program, we go ahead and iterate over the rows that we read in, so we get each row, and then we just write it out to the output file. And csv.writer takes care of adding the commas and quoting for us. So let's go ahead and give this a try by clicking the Play button up here. You can see that in just a moment our process finishes, and that the new cleanup.csv is created for us in the data folder. So let's check out the contents. You can see that a lot of the quoting is gone, and that the semicolons have been changed to commas, so that's good, but that our yes/no values in the final field have not yet been changed to zeroes and ones. So that's something we're going to have to fix before we submit this file to Amazon.

So let's close that file for now. We can fix the final value in either our reading and our writing, and we'll just go ahead and do that in our writing. The first thing that we'll do is we'll go ahead and make sure that we write our header without changing anything. So we'll make a new line here, and we'll just add a line that writes a single row, row zero, where the headers are, without making any changes.

Now, for every row besides the header, we do want to make changes. And so we want to specify in this loop to not include the header, but to include every other row in the set. So we do that in Python with an array slice. So we'll treat this collection as an array, and we might say something like one to five to say that I want rows one to five.

But in reality, I want all the rows that come after the first row, so the first row is zero, and the first row that I want is one, and then after that, I want every row.

So I will remove the five and just leave the colon there, and that means I want one till the end of the rows. So give me everything from one onward. So now that we've specified the rows that we want to process, we need to tell Python what we want to do with those rows. So we'll make a new line in the for loop, and just to make it a little bit clearer, we'll go ahead and create a new variable called y, which was the name of the column. And y is the last item in the row. So r is our row, and we'll say negative one as our index, and that indicates that we want the last column. So it's like counting backwards.

So if we're at zero, we go negative one. We wrap around at the end of the row, and we get the very last value. So now that we've got y, we need to update the last value. This type of expression in Python can read that we want the value to be one if y is equal to yes, otherwise we want the value to be zero. And once we've updated that final field, we can go ahead and write the row to our CSV file. Let's try this out again by clicking the play button. And we can see once again that the process completed very quickly. And we'll open up our cleanup file. And we can see that the final field has now been updated to zero in the cases where it used to say no. Our header has not been changed. And we also have ones in there in places where it used to say yes. So now that we have some clean data, let's visit our S3 bucket and upload it to Amazon. I'll go ahead and shut down PyCharm.

And I've already logged into the S3 console and created an empty bucket for our demo. So this is ca-ml-demo for Cloud Academy machine learning demo. So we can click Upload, and then open finder, drag our file onto the correct part of the screen, and drop it. That queues our file for upload. Click Start Upload, and in a few moments, we'll see that our file uploads to our S3 bucket. And it's done.

So now we're done with this demo. We have some data that's ready to use with Amazon ML. Let's just review real quick. We found some publicly available data, but it had a few issues with it that we needed to clean up before we could get it into the required Amazon ML format. And once we did that with a simple Python script, we uploaded the data to S3 so that we could use it later. I'm Jim Counts. Thanks for watching this demo on preparing CSV data for Amazon Machine Learning.

About the Author
James Counts
Software Developer

James is most happy when creating or fixing code. He tries to learn more and stay up to date with recent industry developments.

James recently completed his Master’s Degree in Computer Science and enjoys attending or speaking at community events like CodeCamps or user groups.

He is also a regular contributor to the ApprovalTests.net open source projects, and is the author of the C++ and Perl ports of that library.