The course is part of this learning path
In this course, we're going to do a deep dive into the various tools and techniques available for manipulating information and data sources along with showing you at the end of it how you can actually solve some real-world problems.
If you are trying to handle increasingly complex data sets and round out your experience as a professional data engineer, this is a great course to get a practical field-based understanding.
Learning Objectives
- Learn to determine when it's appropriate to use a programmatic approach versus pure SQL.
- How to access and manipulate your files and data sources using programming techniques available to you in languages such as Python.
Prerequisites
- Familiarity with relational databases and other data formats such as CSVs and JSON.
- Baseline understanding of SQL
If you don't have all of these this course will still benefit you, but you might not be able to follow all of the examples.
So to step away from theory and to start to live up to this class's description of a practical guide, let's actually show some practical examples. Here we have a table which is just representative of a SQL database for something let's call movies. This is a table of all movies. There are four columns, one for the movie title, one for the year it was made, one for the director and one for its rating.
Now, this is just a five row extra of a table that might be thousands or in some cases, millions of rows. So keep in mind that all of the queries we're showing will scale up to handle the bigger databases as well. And we're just keeping the size small for this thought exercise. Let's compare two different ways of accessing this database.
On one hand, we have SQL and on the other hand, we have Python and within Python we'll be using pandas, which is a great helper library for manipulating data. To kick off our comparison, let's look at what it would take to retrieve all movies.
SQL is a very straightforward select from format. And as a refresher star means all columns or fields and movies refers to the table. On the right, you could see Python's approach and at face value that seems extremely simple. Now, remember when using programming languages, some setup is often required, and not shown here is configuring the database connection, the table and the permissions. But once you get that set up done, you have an extremely easy way to reference the table as a simple variable named movies.
To begin to manipulate this data, let's retrieve the first five movies from the database. With SQL, we will append a limit statement of limit five to this in order to limit our results to five. Now, it's important to note for those of you learning SQL still, this doesn't necessarily guarantee the order of movies because we haven't set a statement that will sort the movies, but we're simply limiting our results to five of them. And pandas, it's actually a little more nuanced in this case we're also not guaranteeing us the same five movies, but by simply putting .headfive in there, we're able to limit it to the first five results as well. And finally, to ask it a bit more of a real world problem, let's try to say, what are the names of all the movies released in 1990.
In SQL, we simply do a select from statement with a where clause upended and you're simply comparing year to the number 1990. And within the select statement, we've replaced the star with the name field. To explain how to do this with pandas, first, we should start to use proper terminology.
The movies objects that I've been calling a variable is more formally called a data frame. There are more courses on pandas and the cloud academy course library, if you're interested in details, but the short summary is this data frame can be filtered with Boolean variables.
So, what we have here is movies.year equals 1990. That'll generate an array of zeros and ones or trues and falses that then allows us to filter the movie's data frame. So, what we're doing is a Boolean variable array within the data frame. And then we're selecting .movies to only extract that field.
In simpler queries such as this, if this is the extent of the data manipulation and access that you require, I would recommend sticking with SQL. However, as you'll soon see in later queries, Python, Java, and other programming languages start to really shine and help you access and manipulate data in a more sustainable and readable fashion.
Let's actually go through another few examples to help tease out where programming languages might excel over SQL. Now, let's say we've done something preliminary analysis and talking to some of the stakeholders and people actually using this data. And we've determined that we need to simply do some analysis of movies by year.
So, what we do is we make a new column called movie count and store a two column table of simple a year and movie count, which we will call movies by year. Now, imagine you wanted to ask the question what's years had the most movies.
Now, in SQL you would need to have three clauses on top of your select statement. You would need an order by movie count, which tells that, that we are going to sort the table by the column movie count. You would need desk or descending to say we're going to start at the highest number and go down and finally limit five, which says limit the return set to simply five rows. And when we compare this to pandas, it starts to become clear why data languages once proficient are easier, pandas and Python by comparison only need one extra clause. The function and largest takes the variables or parameters of five, and then the column name and that alone will allow you to accomplish what SQL require three causes for in one well-documented function. And finally, to add one last bit of complexity, just to really showcase how Python remains easier to read.
Once you overcome the initial readability hurdle, let's try to ask it, what are the five most active or most popular years to make a movie excluding the first five? So in SQL, after the limit statement, we would have to add the expression offset five. Now, this gets a little confusing in that we are now adding clauses after the limit, which is traditionally the last statement of SQL and then Python, all we have to do is do .tailfive. Once again, a well-documented function that allows you to read it. Whereas SQL, we have to kind of read the whole statement, and then interpret it. And Python, we're able to read it across the line, looking at the function being called, understand it, and have it be more readable to a larger team over time.
Let's shift gears and talk about some of the more advanced approaches we can take within a programming language to really accelerate and amplify some of its advantages. And ORM or object relational mapping is a advanced technique used for converting data between a SQL database or an underlying data structure and your object relational language.
In other words, it allows you to create a virtual representation of the underlying data system and put it in a language that's more accessible to Python and object oriented functions. Now, if that sounds a little complex or confusing, don't fret, we're gonna dive into it a bit more, but at a high level, know that an ORM is great at allowing you to do things such as record insertion, selections, transformations, and deletions from an underlying database without needing to write SQL.
Things like pandas might benefit from an ORM. But whereas pandas is a way of manipulating and querying data, this provides the access layer between the data storage format and such manipulations. To dive into the advantages of this and why would wanna use an ORM, let's look at what it enables.
So, by decoupling the underlying database and the object oriented programming within the code, we're able to separate the teams and how each one's being maintained. So, if one team needs to make changes to a SQL database, changing the ORM around just a little bit is way more sustainable and easy to implement than having to change the entire code base. So, it allows for easier development through the separation of different layers of your application.
Secondly, ORMs also are database agnostic. This means you can change what your underlying database is without having to rewrite all of your code especially if features change. A very common pattern for this is if you start in a smaller database, such as my SQL and have to upgrade to a more distributed database down the line, using an ORM could help mitigate any migration effort and pains. And also it helps you protect against things such as SQL injections, whereas one database format might be more vulnerable to another one.
To speak from personal experience, an ORM really helped out when we switched from RDS or Amazon's Postgres offering to Amazon's Aurora where we didn't need to change any real underlying bits of the code past changing where the ORM was pointing. On the performance front, ORMs can really help as well. Many of them have the ability to cache results and help with threatening to remove load from the database.
Basically, if you have lots of common lookups, you're able to configure caching so that the ORM itself will store the language in the application memory, rather than hitting the database. This can be a bit of a double edged sword in that maybe you miss an update.
So please be aware of how you cache and your caching strategy, but the tool and its availability is exceptionally helpful in high performance applications. And finally, one of the last major advantages of an ORM is the ability to predefine relationships.
If you're regularly working with the same data set and find yourself making the same joins time and time again, ORMs allow you to eliminate the need to remember those and potentially mess them up each time by having predefined views and access point to the underlying data built into them. However, like everything, there are always downsides to selecting a new technology, in the case of an ORM, it's getting another library that needs to be learned, adopted, and supported within your organization, company or school, these new libraries, even when helpful do add to the overhead of needing to understand how they work.
Additionally, all of the ease of use features such as predefined relationships and caching add to the overhead. There is more processing power needed to accomplish the same workload in some cases. And although compute power is cheap, this is something that needs to be remembered.
In my opinion, though, the biggest downside of an ORM is that an enforced SQL ignorance. Basically, by allowing people to have different access patterns to the underlying database, than the database was designed for, you're potentially allowing people to make suboptimal queries that don't really perform as well as a straight SQL statement.
Now, this is a bit of an abstract concern that's debatable, but over time, if your users aren't understanding how the database itself is built and are just understanding the ORM, it could lead to problems if you don't try to keep the databases designed access pattern in line with what the ORM is supporting. And finally, the last major downside to an ORM is that it can have degraded performance versus raw SQL sometimes.
Particularly book updates and inserts oftentimes are better in raw SQL than running it through an ORM's library. But this is a bit of an edge case, and isn't as big of a concern as some of the other ones listed above.
Now, the ORM you select will oftentimes be governed by the language you're using, of course. Some of the most popular ones are SQLAlchemy and Hibernate. Although as you can see on screen, there's lots of strong options, even in things such as JavaScript, I recommend testing it and seeing which one's best for you. However, a little tidbit I'll say is be careful of things like Django, which have an integrated web framework versus SQLAlchemy which is just a straight ORM. So, pay attention to what your ORM is packaged with when making your selection.
Lecture
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.