Working With Data Sets
The course is part of this learning path
This course focuses on understanding common data formats and interfaces. It explores some common data formats that you'll encounter as a data engineer. Basically, the goal is to develop a deep understanding of what the pros and cons of storing your data in different ways is. We're then going to focus on how to translate that high-level ethereal concept into a more concrete understanding and really showcase how the same dataset can be accessed and viewed differently if you were to just simply store it in a different fashion.
If you have any feedback relating to this course, please contact us at firstname.lastname@example.org.
- Learn about different data sources and formats, and how to model your data
- Get acquainted with the common data formats — CSV, XLM, and JSON — as well as specialized data formats
- Learn about databases and how to exchange data between applications
This course is suited to anyone looking to gain a practical, hands-on understanding of data modeling and for those who might want to change how they're storing their data.
To get the most out of this course, you should familiarize yourself with the concepts of what a CSV and a JSON is, along with databases at a high level.
By now, some of you must be thinking, what the heck? Why haven't we dove into databases? I'm tired of hearing about files. And before we do that, I just want to make one more key point, and that is the difference between serializing and storing data. This is important when it comes to databases, more so in files. Serialization is basically, taking or translating data structures, so that it can be stored or transmitted, and reconstructed later.
Databases are a phenomenal place to store data. You could take lots of data, throw it in there and store it for an extremely long amount of time. However, it's important to note, whereas maybe you can email a CSV from one person to another, databases don't really support the same level of transmit ability. Some hybrid systems like Microsoft Access appear to. That's a debate for another class, but as a rule of thumb, databases are not good for passing the entire database between systems, what they are really good at is storing and retrieving data as you need it. And finally, before we discuss the specifics of a database, it's important to understand how the data model looks from a practical sense.
Here we've shown what a data model could look like for a small recording studio, where you have albums, genre, and artists, and on the right, you can see a Table Create statement to track their customers.
Now, for those of you following along with handwritten entities and testing out your different formats, feel free to try to put your data model into this format. And you're going to notice that, unlike when working with files, this really allows you to tailor your storage from a format to more closely resemble the ideal entity model that you've described. The complexity is, this all needs to be defined ahead of time in some types of databases before you can start storing data.
When it comes to databases, relational databases are by far the most common. If you're following along in the learning course about data engineering, we've briefly touched on databases in previous classes. This class I was gonna more focus on what it means for your data and your data model.
So relational databases, data is stored in tables, in rows and columns. Very similar to a CSV. However, an important difference is that these tables have programmatic enforceable ways of relating to each other. So perhaps you have one table that has record names and another table that has record genres, and you can make a programmatic formal controllable link between them, so your data model starts to become codified.
You might hear terms such as normalization and de-normalization. This is simply a spectrum of how many tables you have and how isolate each table is. For example, a table that describes me the instructor might have a table for names and a table for roles. So you would have Chris Gambino, instructor, two separate tables. A de-normalized schema would simply have three columns, first name, last name, role, no need for a join.
There are advantages and disadvantages to each, but at a high level, that is what you should be thinking of when you think of normalized versus de-normalized in reference to relational databases. Also very importantly, unlike files, well, most files, Apache, Avro, and a few of the more specialized formats aside, there's not really the concept of just viewing a database, you need to use SQL or Structured Query Language in order to access the data.
Now, this is covered in depth in the Cloud Academy archives, so please search there if you need a brief refresher on how SQL works. So some of the most common examples of a relational database are really what's in Amazon RDS, Google, Cloud SQL, or some of the Azure services.
Basically, you have MySQL, SQL Server, Postgres. And then Amazon has things like Aurora and Google has things like BigQuery. Basically though, all of these have different quirks and different advantages. They all accomplish very similar activities with slight differences between them, such as Postgres can really be oriented towards time series data, while MySQL, you know, in my personal opinion is one of the easiest, if not the easiest to set up.
The other major category of databases is NoSQL, which stands for not only SQL, unlike what I thought at first of not SQL. And these differ from relational, or what's called SQL compliant, said in air parentheses, databases, in that NoSQL databases, your data does not need to neatly line up in the tabular row, column structure. In fact, depending on the database you pick, there's a lot of different methods to store and retrieve the data.
So at a high level, just think NoSQL sacrifices some of the direct tight control that a SQL compliant database forces, while giving it back to you in terms of flexibility. Now, this flexibility is really extremely valuable when dealing with large amounts of variable data.
Personally, I've worked with this when it comes to IOT development facilities, where new sensors might be coming online. We were working with an automobile where the test engineers one day would have things like, the windshield position sensor plugged in, and the next day the fuel gauge would be plugged in, and then the next day, all of them would be missing and replaced with an RPM meter on one of the tires.
So the dynamic scheme of NoSQL allowed us to capture that, whereas going with a SQL compliant database would force us to constantly update the schema or consider what that means from a normalization perspective. Now, that's not to say flexibility is king, sometimes you want control. Sometimes we wanna know that there's something different, but if it's expected, NoSQL is how you can start to handle that.
With NoSQL databases in particular, it's important to kind of draw the distinction between what's a non-cloud specific one and one that a cloud provider is giving. All of the cloud providers seem to really like to give proprietary only on their platform, NoSQL databases. So in the case of kind of portable ones, you have your MongoDB, your Couchbase, your CouchDB. These are ones that you can kind of take wherever you want in on-prem, but if you're willing to tie yourself to a cloud provider, you'll have some really strong choices as well.
Amazon's DynamoDB, and Azure's Cosmos DB in particular deserve a shout out for offering phenomenal ease of use and scalability. A very important thing too, just to point out with these NoSQL databases, is how they interact with JSON. NoSQL databases in particular really have good APIs, well, most of them do, to store JSONs natively. So you can send it a JSON, and it'll just store it and then return it to you just like that.
So if you're thinking that you need to provide people regular exports and port sum of it, maybe a NoSQL database with a JSON like we're showing is a really strong choice.
Introduction - Data Sources and Formats - Modeling Your Data - CSV - XML - JSON - Specialized Data Formats - Exchanging Data Across Applications - Applying What we Have Learnt - Sales Data for an Online
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.