Working With Data Sets
The course is part of this learning path
This course focuses on understanding common data formats and interfaces. It explores some common data formats that you'll encounter as a data engineer. Basically, the goal is to develop a deep understanding of what the pros and cons of storing your data in different ways is. We're then going to focus on how to translate that high-level ethereal concept into a more concrete understanding and really showcase how the same dataset can be accessed and viewed differently if you were to just simply store it in a different fashion.
If you have any feedback relating to this course, please contact us at firstname.lastname@example.org.
- Learn about different data sources and formats, and how to model your data
- Get acquainted with the common data formats — CSV, XLM, and JSON — as well as specialized data formats
- Learn about databases and how to exchange data between applications
This course is suited to anyone looking to gain a practical, hands-on understanding of data modeling and for those who might want to change how they're storing their data.
To get the most out of this course, you should familiarize yourself with the concepts of what a CSV and a JSON is, along with databases at a high level.
So next up, let's dive into how you can actually store and access this information and go through some of the most common data formats. But if you want to follow along, I encourage you to pause here for a moment and come up with a simple data model, as it relates to your company or team. It could be something as simple as your code promotion process, if you're a deeply technical programmer or maybe it's exactly what we showed before if you're in a sales department or maybe it's customer support tickets and customers. Doesn't need real data, but pause here for a second and come up with a very basic data model, maybe with two entities and one relationship.
So one of the most basic and common data sets that people work with due to its high portability is the comma separated values or CSV format. Now you might know some of its siblings or cousins such as tab separated or pipe delineated, but the point is, is this is a semi structured data format and it's exceptionally well designed to handle flat data that can be represented as rows or columns.
Now not surprisingly, each comma in the CSV is separated by what's called the delimiter. In comma separated values it's commas, in pipe separated values it's pipes, and in tab separated values it's tabs. So one of the biggest weaknesses of this, however, is if your data contains that delimiter, the data can get hard to parse and get confusing. There are ways to handle it with what's called escape characters but if your data has a lot of commas or pipes or tabs, this might not be the best format.
However, CSVs it can't be understated are one of the most common import-export data formats for things like spreadsheets, databases, and other enterprise software. So if you need something that's quick to set up, highly portable and widely accepted a CSV is a strong choice.
So let's come back to our previous example. And if you're following along, feel free to do this exercise for your work too. Let's represent our data model in a CSV. Now here you can see we have a sales ID as the first column, and then some attributes about that relationship, such as sales date, sales total.
Then we have a customer number, which is a reference number for the customer entity and then a product number and the product name. So you can see that we have the customer number and the customer name, referencing the customer entity. We have the product number and the product name referencing the product entity. And then we have some attributes about the relationship, AKA the sale.
Now there's better ways to portray this type of data, particularly as we get more complex, but this is a perfectly valid way to start to store simplistic flat data. So for those of you following along, feel free to pause for a minute and build out a CSV for your data model. But to go through the pros and cons of CSV in a bit of a list fashion, basically they're simple and easy to work with and easy to parse, most programs understand them. They go straight into your Excel or other spreadsheet program of choice. But the downside is they don't really handle data duplication and sometimes delimiters get messed up.
Data types are not strongly enforced. So maybe a number and a string get mixed up. And also, this comes up especially in bigger data sets, column names are not always present. A single row doesn't describe itself if you don't have the first row. So CSVs, great to get started with, easy to use, but don't really support increasingly complex data sets. And they also don't offer any protections around duplication and type enforcement.
Introduction - Data Sources and Formats - Modeling Your Data - XML - JSON - Specialized Data Formats - Databases - Exchanging Data Across Applications - Applying What we Have Learnt - Sales Data for an Online
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.