Basic Data Wrangler Operations
Start course

Learning Objectives

This course is an introductory level AWS development course. You will learn about the AWS Data Wrangler library, what it does, and how to set it up to be able to use it. 

Intended Audience

This course is intended for AWS Python developers familiar with the Pandas and PyArrow libraries who are building non-distributed pipelines using AWS services. The AWS Data Wrangler library provides an abstraction for connectivity, extract, and load operations on AWS services. 


To get the most out of this course, you must meet the AWS Developer Associate certification requirements or have equivalent experience.

This course expects that you are familiar with and have an existing Python development environment and have set up the AWS CLI or SDK with the required configuration and keys. Familiarity with Python syntax is also a requirement. We walk through the basic setup for some of these but do not provide detailed explanations of the process. 

For fundamentals and additional details about these skills, you can refer to the following courses here at Cloud Academy:  

1) Python for Beginners 

2) Data Wrangling With Pandas

3) Introduction to the AWS CLI 

4) How to Use the AWS Command-Line Interface



Data Wrangler Operations. The AWS services that Data Wrangler supports are over a dozen. They include fundamental services like S3, RDS, and DynamoDB. The table shown gives you a sense for what is being supported. Notice AWS Glue, Athena like formation, Red Shift. Notice also, open search, DynamoDB, time stream, EMR, even CloudWatch logs. Data Wrangler handles session and AWS credentials using the Boto3_Session. In general, you can use the default Boto3_Session. You can customize the default session like changing the region, for example, before you use it. And finally, you can use a new custom Boto3_Session if you want. 

Data Wrangler is stateless and developers need to manage the sessions. For this demo, we will use the default session. We have created a bucket called CA Data Wrangler, and in the bucket we uploaded three image files. Notice the response of true indicating the connection and listening via the API. Next, we list the objects in a bucket to get the list of the three images uploaded to the CA Data Wrangler bucket. Finally, we delete one of the files and list the bucket again for confirmation. This verifies our Python script to be able to communicate with Amazon S3 and operate on it using the API available via AWS Data Wrangler. Data Wrangler supports additional operations on Amazon S3, and that includes copy a list of S3 objects to another S3 directory, delete S3 objects, describe S3 objects, check if an object exists, download a file, and the select query to filter contents of an Amazon S3 object based on a SQL statement. 

For Amazon S3, writing file types that are supported as you get to see our CSV, JSON, Parquet, and Excel. For reading file types, then you have CSV, JSON, Parquet, Excel, and in addition you have fix with format. That's only for reading, fix with formatting. If for any reason we need to manage passwords and secrets, Data Wrangler is able to interact with AWS Secrets Manager to obtain passwords along with other secret resources. On a quick test, we created a secret intended to represent a password and stored it in Secrets Manager. In order to get the secret, we need to make the call similar to the one shown on the screen. 

We run the corresponding code on the notebook to get the results accordingly. AWS Data Wrangler uses AWS Glue catalog to store metadata, tables, and connections. You can create and operate on Glue catalog databases and tables. Data Wrangler also integrates with Amazon Athena, and therefore with AWS Data Wrangler, you can run SQL queries on Amazon S3 using Athena directly. This will permit you to handle large data sets from within your Python code using the Data Wrangler library. With regards to Athena, you can create the default Athena bucket. If it doesn't exist, you can get the data type of all columns query, you can fetch a query execution details, and then of course execute any SQL query on AWS Athena and return the results as a Pandas DataFrame. 

Athena is also integrated with AWS Glue Data Catalog for a unified metadata repository across services. You can also provide access to RDS instances with PostgreSQL, MySQL, or Microsoft SQL server. The API is the same for all three engines. You can return the connection from a Glue catalog connection or Secrets Manager with the connect call. You can return a DataFrame corresponding to the result set of the query string. You can return a DataFrame corresponding to the table and you can write record store in a DataFrame into the engine. This is just some of the general functionality available with AWS Data Wrangler to integrate Python, Pandas, with AWS data services. There's over a dozen integrations included in the library. Data Wrangler simplifies the process of connecting and importing data into Pandas DataFrames, so that you can focus on the transformation part of the ETL process.


About the Author
Jorge Negrón
AWS Content Architect
Learning Paths

Experienced in architecture and delivery of cloud-based solutions, the development, and delivery of technical training, defining requirements, use cases, and validating architectures for results. Excellent leadership, communication, and presentation skills with attention to details. Hands-on administration/development experience with the ability to mentor and train current & emerging technologies, (Cloud, ML, IoT, Microservices, Big Data & Analytics).