Wrestling with Data
In this course, we're going to do a deep dive into the various tools and techniques available for manipulating information and data sources along with showing you at the end of it how you can actually solve some real-world problems.
If you are trying to handle increasingly complex data sets and round out your experience as a professional data engineer, this is a great course to get a practical field-based understanding.
- Learn to determine when it's appropriate to use a programmatic approach versus pure SQL.
- How to access and manipulate your files and data sources using programming techniques available to you in languages such as Python.
- Familiarity with relational databases and other data formats such as CSVs and JSON.
- Baseline understanding of SQL
If you don't have all of these this course will still benefit you, but you might not be able to follow all of the examples.
Let's quickly dive into an example of how an ORM can simplify data manipulation. And this example, we're gonna discuss a university, which is keeping track of which clubs and extracurricular activities its students are participating in. We can represent this relationship in a simple logical model shown on screen. If you've been attending the data engineering learning path, you might recognize this type of representation as a logical representation that simply shows how relationships connect different entities.
Now, I know there's a lot of code on screen and it covers two languages of both Python and SQL. Don't worry about it, if you don't want to follow all the code, or it's hard to read on your screen, I just wanted to include it if you want it to pause and see how this example got created.
Basically, we're creating three tables to store all of the information of student, club, and then a table that allows us to build a one to many relationship between the two of them. On the left, you could see us starting to create the Django ORM that allows us to connect to it and begin to ask simple queries. Again, this is just here for reference so people can see how it gets created.
You can pause and copy it for your own use, or frankly, you could just keep watching and not worry about the specifics of the underlying infrastructure. To dig into how the ORM accesses the underlying pattern. Let's look at a few examples where we are going to use Django and Python in order to ask the database questions.
Firstly, if we want to view all the clubs that a particular student belongs to, we can simply do students dot objects dot get dot clubs. And what we're going to do is pass that the student ID number in the get clause. Now you may notice this is a very chained together way of accessing information, but importantly, we didn't need to run a join in order to access the information between students and clubs. This gets exponentially more valuable.
The more complex your data source gets, ORMs are also really cool because the same way you're able to get the clubs that a student belongs to. You're able to run it in reverse and you're able to ask it what students belong to a club, literally using the same format and requesting a different output row.
But remember, ORMs, aren't just good for accessing data. You're also able to change the names of objects, such as club names, and here you can see we're retrieving a club we're then setting the club dot name equal to a new name and saving the results, aka, committing it to the database. This is really useful and cool. Particularly as the complex transformations get more and more complicated. And maybe you have to clean data such as the aforementioned user entered dates that we discussed at the beginning of this class. And finally, just to show kind of a few transformations and new data creations at once. We're creating a new club named photography. We're creating a new student and then adding them to that club.
So this shows you that you can create data and manipulate data and then create relationships all through the ORM without having to worry about the underlying SQL. So taking everything we've learned from ORMs to access patterns, to start formats, to the advantages of SQL versus programming languages, let's run through a real-world example.
Imagine that you're a professional data engineer and your job is to support a sales team that is attempting to acquire new clients for your company. All of this data is in a CSV comma, separated values file within Amazon S3 storage buckets. This data contains information such as daily sales and transaction data. They'll need to load into a database for management to take a look at through an analytics and reporting tool.
How would you begin to design this data pipeline? What high-level components would you want to make this work? Feel free to pause it right now. If you want to sketch out your own little architecture or keep watching if you want to see the answer.
So firstly, on the left, of course, we have the CSV within the Amazon S3 bucket. This information is loaded in by an external tool. And as such, that process is not something we really need to worry about. What we do need to worry about is how are we going to get the information out of S3 and into a database. So for that, we can turn to Python and Django.
So basically we're able to reach out to the CSV and S3, pull it in, manipulate it, and then store it via an ORM into a SQL database. Now, remember an ORM is simply how we interact with the database. You might need an additional helper libraries such as pandas. If there is heavy-duty manipulation required.
Now it's a little unclear in this example, what reporting framework will be supportive. There are many of them, many reporting tools and VI tools that connect directly to the database. So once we store the data in a good format, it's not our concern. However, if it's a custom home built solution, the ORM the same one that we use to load data to the database can provide a phenomenal interface for in house developers to create custom reports. In addition to whatever tool they're using supports by default.
So in summary, we're using Python with an ORM and potentially a helper library, such as pandas to pull data from S3 and CSV format, and then automatically translate and store it in a SQL database. And from the SQL database, either a tool can connect directly or take advantage of the same ORM we use to load the data, to access the data for downstream reporting.
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.