Working With Data Sets
The course is part of this learning path
This course focuses on understanding common data formats and interfaces. It explores some common data formats that you'll encounter as a data engineer. Basically, the goal is to develop a deep understanding of what the pros and cons of storing your data in different ways is. We're then going to focus on how to translate that high-level ethereal concept into a more concrete understanding and really showcase how the same dataset can be accessed and viewed differently if you were to just simply store it in a different fashion.
If you have any feedback relating to this course, please contact us at firstname.lastname@example.org.
- Learn about different data sources and formats, and how to model your data
- Get acquainted with the common data formats — CSV, XLM, and JSON — as well as specialized data formats
- Learn about databases and how to exchange data between applications
This course is suited to anyone looking to gain a practical, hands-on understanding of data modeling and for those who might want to change how they're storing their data.
To get the most out of this course, you should familiarize yourself with the concepts of what a CSV and a JSON is, along with databases at a high level.
At the most basic level, it uses a key value pair in which you have a key and a value, an array, or in some cases a nested JSON. This is a self-describing data set that's an extremely compact and versatile way to access information. On a personal note, I personally prefer this data format when any project I'm doing tends to outgrow a CSV, so that's when it starts to get bigger or maybe it's a little more complex, or maybe the data doesn't quite fit into a two degree array of tables like a CSV once.
\So here you can see how we have our sales data represented as a JSON file. As you can see, we have a main sales object, which within it contains a second object, which is an array of all the sales. We have information on the sales, such as ID and date, along with nested information of customer and product, so you can see here that it, similarly to an XML, can store complex relationships beyond what a CSV can offer, but it's also more commonly used on the cloud.
Some databases, which we'll discuss later, such as Mongo DB can natively accept JSON files being directly deposited into them, so for those of you following along, perhaps pause here and write your data model out as a JSON, but the pros of this really are it's a lightweight data format that's widely accepted in many modern applications, particularly on the web.
Now, it does support some schema control, but there's really no standardized schema or way of controlling it. It does support easy sharing, but it doesn't support namespaces and it doesn't really have space for metadata, so if your data is nestable and you need it to be widely shareable, this is great. If your data has a lot of complex relationships, maybe JSON isn't the right data format for you.
Introduction - Data Sources and Formats - Modeling Your Data - CSV - XML - Specialized Data Formats - Databases - Exchanging Data Across Applications - Applying What we Have Learnt - Sales Data for an Online
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.