Background Concepts for AWS Data Wrangler
Background Concepts for AWS Data Wrangler

Learning Objectives

This course is an introductory level AWS development course. You will learn about the AWS Data Wrangler library, what it does, and how to set it up to be able to use it. 

Intended Audience

This course is intended for AWS Python developers familiar with the Pandas and PyArrow libraries who are building non-distributed pipelines using AWS services. The AWS Data Wrangler library provides an abstraction for connectivity, extract, and load operations on AWS services. 


To get the most out of this course, you must meet the AWS Developer Associate certification requirements or have equivalent experience.

This course expects that you are familiar with and have an existing Python development environment and have set up the AWS CLI or SDK with the required configuration and keys. Familiarity with Python syntax is also a requirement. We walk through the basic setup for some of these but do not provide detailed explanations of the process. 

For fundamentals and additional details about these skills, you can refer to the following courses here at Cloud Academy:  

1) Python for Beginners 

2) Data Wrangling With Pandas

3) Introduction to the AWS CLI 

4) How to Use the AWS Command-Line Interface



Background Concepts for AWS Data Wrangler. Before we dive into the setup of AWS Data Wrangler, let's review some of the basic moving parts and their function. The general idea of extract, transform, and load or ETL, is a process to integrate data from multiple sources into a single consistent data store. During the ETL process, data is extracted from sources, transformed, and if needed, integrated with other data and stored someplace else. The AWS Data Wrangler Python Library uses Boto3, Pandas, and APACHE ARROW. 

Boto3 is the name for AWS's Python software developers kit. It allows you to use AWS services from your Python code. Pandas is a library of data structures and operations for manipulating data and performing analysis. A Panda's DataFrame is a commonly used data structure, and represents a two dimensional structure with rows and columns similar to a spreadsheet or a SQL table. It is important to note, in a Pandas DataFrame, the columns can be of different types. PyArrow specifies a column-based memory format for flat and hierarchical data to facilitate communication between multiple components and performing the format conversion. 

Finally, AWS Data Wrangler allows you to perform data extraction and load operations, including data transformations using AWS data services like Amazon S3, Athena, Redshift, DynamoDB, and others. The AWS Data Wrangler Library connects Panda's DataFrames to AWS data services. The functions handle the data extraction and data loading steps in Python. This can simplify your data pipelines as you get to dedicate and focus your resources to the transformation part of the extract, transform, and load procedures.


About the Author
Jorge Negrón
AWS Content Architect
Learning Paths

Experienced in architecture and delivery of cloud-based solutions, the development, and delivery of technical training, defining requirements, use cases, and validating architectures for results. Excellent leadership, communication, and presentation skills with attention to details. Hands-on administration/development experience with the ability to mentor and train current & emerging technologies, (Cloud, ML, IoT, Microservices, Big Data & Analytics).