Working With Data Sets
The course is part of this learning path
This course focuses on understanding common data formats and interfaces. It explores some common data formats that you'll encounter as a data engineer. Basically, the goal is to develop a deep understanding of what the pros and cons of storing your data in different ways is. We're then going to focus on how to translate that high-level ethereal concept into a more concrete understanding and really showcase how the same dataset can be accessed and viewed differently if you were to just simply store it in a different fashion.
If you have any feedback relating to this course, please contact us at firstname.lastname@example.org.
- Learn about different data sources and formats, and how to model your data
- Get acquainted with the common data formats — CSV, XLM, and JSON — as well as specialized data formats
- Learn about databases and how to exchange data between applications
This course is suited to anyone looking to gain a practical, hands-on understanding of data modeling and for those who might want to change how they're storing their data.
To get the most out of this course, you should familiarize yourself with the concepts of what a CSV and a JSON is, along with databases at a high level.
So the next data format you should really be familiar with is XML or extensible markup language. You may already be familiar with this. It's especially common in bigger enterprises where it's used in many business applications. At the most basic level, it's a hierarchical tree structure that consists of a logical document of elements.
Now, these elements may contain information. Like we said, attributes. They also have the concept of like metadata within them. So this is a more structured way of representing your data. It is widely used for things such as EDI or SOAP. And in my opinion, and this is not something that is formal or anyone's position other than my own, XMLs are not widely, as widely used anymore. They are still extremely useful and extremely valuable. But in my experience, I'm seeing more of a shift towards JSONs and other data formats, but XML, it can't be undersold here. They are extremely versatile. They're extremely structured and they're extremely widely accepted.
So XMLs are a little more confusing than CSVs. So let's break this down and go through it step by step. This is the sales data that we previously showed as a CSV. And maybe is what you would like to try to express your own data model in. You'll notice at the root level, that element is called the sales element. And then within that, we have nested sale singular elements. Within the sale amount, we have a customer element.
So what you can see here is we start to express your data in a hierarchical structure. That's very straightforward. We can clearly see the relationships between entities such as there is a sale that involves a customer and a product. A big advantage of XMLs too. If you look here, you're able to see that data is stored a little differently.
So within the customer, you can see that within the brackets is the ID and the name. However, then you have quantity purchased where the value is stored between the brackets. This is a great way to delineate information or attributes about an entity and information and attributes about the relationship. So you could see a customer has the attributes of ID and name where the product has the relationship, or has the attributes of ID and name and price. And then you have the sale information of quantity purchased.
So XMLs are a little more complex than CSVs, but they really pay out in allowing you to store information in a bit more structured sense without having to fully commit to a highly structured data format. So to pause here for a minute, if you want to try to express your data model in XML, those of you who have certainly more complex data models might find this useful, but to go through the pros and cons of an XML, it's great because it's self describing and easily readable by machines and humans.
Basically unlike a CSV where if you lose your hetero, you're kind of lost. XMLs are self contained in every step. However, this has a cost. The syntax can get a little verbose and redundant and files and data sets can get very large. The good news is they can be broken up, but parsing can get a little slow because you're processing the same information again and again.
Introduction - Data Sources and Formats - Modeling Your Data - CSV - JSON - Specialized Data Formats - Databases - Exchanging Data Across Applications - Applying What we Have Learnt - Sales Data for an Online
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.