Wrestling with Data
In this course, we're going to do a deep dive into the various tools and techniques available for manipulating information and data sources along with showing you at the end of it how you can actually solve some real-world problems.
If you are trying to handle increasingly complex data sets and round out your experience as a professional data engineer, this is a great course to get a practical field-based understanding.
- Learn to determine when it's appropriate to use a programmatic approach versus pure SQL.
- How to access and manipulate your files and data sources using programming techniques available to you in languages such as Python.
- Familiarity with relational databases and other data formats such as CSVs and JSON.
- Baseline understanding of SQL
If you don't have all of these this course will still benefit you, but you might not be able to follow all of the examples.
Hey everyone. Welcome to Wrestling with Data, a class in which we take a practical approach to data manipulation. In this course, we're going to do a deep dive into the various tools and techniques available for manipulating information and data sources along with showing you at the end of it how you can actually solve some real world problems.
Specifically, the goals of this course is to learn to determine when it's appropriate to use a programmatic approach versus pure SQL. Specifically how to access and manipulate your files and data sources using advanced programming techniques available to you in languages such as Python. This course will focus on Python as it's one of the more common languages, but all of the lessons here are applicable to any of the object oriented languages.
Now, there are some prerequisites for this class. You are going to need a familiarity with relational databases and other data formats such as CSVs and jsons. You'll also need a baseline understanding of SQL. And when I say baseline, you need to be familiar with the select inserts and updates. You don't really need to know more advanced commands such as upserts and how that affects things. And finally, you need some experience with an object oriented programming language, such as Python or Java.
Now, if you don't have all of these this class will still benefit you, but you might not be able to follow all of the examples. And finally, if you're still on the fence about if you should attend, if you are trying to handle increasingly complex data sets and round out your experience as a professional data engineer, this is a great course to get a practical field based understanding.
So to begin our discussion, it's important to understand how data has changed over the last few years. Going back like 15 or 20 years, especially if you're at a big company such as a Fortune 500, definitely if you're in a Fortune 100, you're going to see that basically data was in a database, typically a mainframe most likely by IBM. This is a very controlled, very centralized approach to information which was limited by the technology. But what you're going to see is that most legacy data approaches use SQL or some form of structured query language.
Over time however, as technology evolved, as more people got into the space and as competitors start democratize access to data, data and outcomes in many different shapes and formats. When we talk about that today, we're no longer just talking about SQL databases and relational databases. I'm sure everyone here has seen a CSV an XML or something like that. And now when we talk about data, it could come in S3 bucket. It could come in a noSQL database. It could come in terms of raw files on an FTP or on your local disc. There's a much more open approach to data as you're approaching newer problems.
Now just so anyone watching this thinks that SQL is bad or SQL is outdated, it absolutely isn't. It's just that in the past, when it's come time to interacting with data, which means either building our data model or selecting the file format or manipulating and transforming the data, SQL might've been your only real choice.
Now, Java is also a choice that came into the scene a lot earlier, but there's a lot better and more powerful ways to accomplish it. Particularly if you're with a smaller team or smaller company that doesn't have the legacy overhead from previous projects. In terms of pros and cons, let's quickly go through what SQL's both good and bad at both as a data query language and a data access language, so SQL is extremely interactive. That means you get answers immediately. You're in a system you're logged in. You submit a query. You'll typically see results almost immediately. This allows you to quickly see, interact and play with your information.
SQL is also extremely simple and quick. You're able to just submit queries and get results without the need for production deployment. You don't need to import libraries. You don't need to worry about system compatibilities. If SQL sequel client supports the format you're submitting a query in, you can get results immediately. And finally SQL is really, really good at combining multiple tables and multiple data sets.
So if your manipulation job requires you to transform information that might come from a variety of different source systems, SQL is really good at that. To put it in perspective, you might have a customer table and a product table, and you might need to join these two to see which customers have bought which products, by selecting simple data elements or fields as they're called in SQL from each one, you're able to quickly join them and look at them as a unified data structure. However, for all of its strengths, SQL starts to fall short when it comes time to manipulating data. It's great for access. It's great for querying. It's great for simple insertions, but if you have more complex transformations, particularly repetitive tasks in which 1000 different records with unique transformation needs to happen each time SQL starts to get very cumbersome.
Basically SQL is great if you're manipulating structured data structured in structured out, but if you need to do data cleanup and data manipulation on a row by row basis, it's not really designed for that and you're going to start to have unwieldy colossal statements that are hard to read and maintain.
Now a classic example of where SQL starts to fail in terms of data transformations is on things such as user entered dates, where we need to look at what the date is trying to tell us and transform it into an actual timestamp or date time data format. Now, many databases have a system for handling this, but they're not flexible. If you have unknown free text coming in, particularly if it needs to conform to a solid, rigid standard, you're better off having a preprocessing language handle it such as Python or Java before you attempt to wrangle it with SQL. And finally queries with lots of joins can become very complex over time.
Although it is really good at the initial join data together and show it, if your data is formatted in a highly normalized fashion with lots of different information that isn't normally joined together, these queries can become very complex and hard to debug. So therefore, if the data has a lot of nuance to it and there's a lot of hard to mentally follow connections, SQL can become unwieldy over time.
In contrast to SQL are programming languages that are exceptionally good at data manipulation. Some of the most common of these are Python and Java along with Scala. Although you see other ones such as C# being used to manipulate information, depending on the developer's stack and preferences.
In my personal opinion, Python is by far the leader and in my personal personal opinion, the best option for this if you're looking to learn a new language. So in contrast to SQL in which the data access layer is very much tied to the storage format, programming languages and data access languages have a lot more flexibility, which is both their greatest strength and greatest weakness.
With languages you're able to completely customize how you interact with information and data. You're able to write things such as loops and conditional statements. And you're also able to manipulate strings and reformat data in ways that structured SQL language, which is tied to its underlying storage format, simply doesn't allow. And by extension of this flexibility and decoupling from the storage format, you're able to combine multiple data sources using a programming language. This means you're able to pull in data from a CSV, a SQL database, a NoSQL database and some jsons we found on the internet and combine them into a composite data source.
Now, this flexibility is great, but it's also dangerous because you can quickly start to lose control of your data and not have good type enforcement or good data consistency. And finally, with many programming languages in particular Python and Java, there are vast numbers of helper libraries and frameworks for you.
A lot of problems have been solved before or parts of the problems have been solved so you often don't need to write everything from scratch. You can often get an expertly made pre-made library to really help you with your challenge. However for all of their strengths, programming languages have some pretty significant weaknesses.
Very importantly, they're more difficult to set up. You often have to do something such as install libraries, a package manager, and selecting an IDE. Even if you do some DevOps magic with containers and helper functions, you're still going to need more infrastructure than a simple SQL editor. Furthermore, becoming proficient can take quite a bit of time.
Cloud Academy has plenty of classes on teaching you how to use a programming language, but compared to SQL to reach a similar level of proficiency will be more time-intensive. But in my personal opinion, it is extremely worth it to become at least beginner to intermediately proficient with a language, either Python or Java, if you're looking to become a data engineer. So although it takes more time, it is worth it.
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.