The course is part of this learning path
This course explores data sources and formatting, and how to present data in a way that provides meaningful information. You'll look at data access patterns, and how different interfaces allow you to access the underlying information. This course also provides a practical, real-world example of how all this theory plays out in a business scenario. By the end of this course, you will have a good foundational understanding of how to wrangle and visualize data.
If you have any feedback relating to this course, feel free to reach out to us at firstname.lastname@example.org.
- Understand the difference between data and information
- Learn how to make data useful in order to gain insights from it
- Learn how to store data correctly
- Understand how these techniques can be applied in the business world
This course is ideal for anyone who is required to interpret or understand data for reporting purposes or for use in machine learning initiatives.
To get the most out of this course, you should be familiar with relational databases such as SQL or NoSQL and some common data formats such as CSV and JSON.
With two of the four processing steps being organization and enrichment being directly tied to underlying storage, it almost goes without saying that how you store and record your data is critical. A key distinction in storage formats of how you store both data and information is are you storing for efficiency or usage? Understanding this difference is what starts to separate a new data engineer from a more experienced one. And basically, the cornerstone of this is machines interpret data differently than humans.
What is scalable and fast for a machine might not be scalable and fast for a human. And at no point is this different greater emphasized than when comparing normalized versus de-normalized storage formats.
So to dive into this, let's look at two different ways of storing the same information. One in a normalized and one in a denormalized fashion. In both cases, we'll be storing the same information. Which is employees, the role, and what company they work for. To look at the normalized one first, you can see on the screen, an organization table. This by itself just has the name of a company and ID, and then a foreign key reference to type. A foreign key reference, as a quick refresher, to those of you new to databases, just references the ID column in another table.
In this case, the type foreign key references the type value of the organization type table. So here you could see type values, one, two, three, and four, and the type name. So we can see that Dunder Mifflin, being a type four company, is a business corporation. This foreign key reference stops us from having to record the much longer type name in every row, and simply replaces it with a single integer. But in turn requires us to do a lookup.
And then finally we have an employee's table, which has an organization ID labeled "Org ID" on the end of it, which in turn references the organization table. So, to get a full row of information, we need to do two joints. First, an organization needs to be joined to organization type via the type that ID. And secondly, the employee table needs to be joined to the organization table via the org ID.
This way is extremely efficient and machines are very quick at doing these types of joins, but humans aren't as well suited for this. So although this is good for a programmatic application access pattern, this is not how we would want to process and prepare data for human consumption.
Now consider if we wanted to make a denormalized view of this. On screen, you're going to see an extremely denormalized view that also has groupings. Simply start at the individual employee level, and you'll see that we have an employee ID, a name and a title. This is a way of denormalizing the data from earlier and above it you can see the company.
Now, these companies are groupings of employees and in turn, it's extremely easy to read. But information is not quite as organized, it's a little harder for a machine to go through. But to a human reader, it's much more straightforward and clear to see. And one final point when denormalizing data, don't be afraid to make new fields.
You might notice here that names are one field. This is in contrast to the normalized format in which first name and last name are separated. So don't be afraid to manipulate the data and transform it however you need. On a more technical note, if you're doing this in real life, consider making a view on top of normalized tables. That way you get some of the benefit of having a normalized database and some of the readability of a view on top of it.
The downside to this of course is it's an extra step, but there are very large gains in creating views is an industry standard and very common way of achieving both the advantages of a normalized view and the readability of a denormalized view.
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.