The course is part of this learning path
This course explores data sources and formatting, and how to present data in a way that provides meaningful information. You'll look at data access patterns, and how different interfaces allow you to access the underlying information. This course also provides a practical, real-world example of how all this theory plays out in a business scenario. By the end of this course, you will have a good foundational understanding of how to wrangle and visualize data.
If you have any feedback relating to this course, feel free to reach out to us at firstname.lastname@example.org.
- Understand the difference between data and information
- Learn how to make data useful in order to gain insights from it
- Learn how to store data correctly
- Understand how these techniques can be applied in the business world
This course is ideal for anyone who is required to interpret or understand data for reporting purposes or for use in machine learning initiatives.
To get the most out of this course, you should be familiar with relational databases such as SQL or NoSQL and some common data formats such as CSV and JSON.
Let's highlight the data processing efforts required to make data into information with a case study. This is a real world example, something I've personally have had to wrangle with throughout my career, and that's sales reporting and how people have trouble interpreting that data and reading it as information.
Typically a sales report is gonna be an Excel sheet or a CSV sheet and management and nontechnical people hate having to deal with raw numbers and prefer to see an interpreted report presenting information. Typically with this type of problem, there's a lot of problems within this type of problem. But within this challenge is that there's no easy or normal way to summarize the data and group it.
Data typically needs to be grouped in different ways than is presented in a raw sales spreadsheet, and that creates a problem. So we have to think about how do we keep that out of flexible enough to change its grouping, but also speed it up so that it's not as slow as having to read through every single file and parcel each time. And finally, with data such as sales, which changes drastically date by day or month by month, it requires a baseline level of constant manipulation in order to get accurate, usable information on a daily basis, or whenever you look to view the report.
So how can we help the situation? What do we learn from our ability to process and prep data that will help us take this unwieldily amalgamation of Excel sheets and turn them into a controllable, reliable report?
When approaching a problem like this, the first thing you need to do is determine what data you have available, and what data is important. In this case, we have several pieces of data of which one is the date, and then the others are split between describing the customer and describing the product. So we really have three domains of data here, being a date, a product, and a customer.
Additionally, as we saw in the processing step previously, insights are really a function of your storage and access pattern and the visualizations and trend analysis you could put on top of them. So not only do we need to consider what data is available, we need to consider how we store it and how we can interact with it.
We actually have a whole course on storage formats in the data engineering learning path. If you haven't watched it yet, where we dive into subjects such as picking the right database and defining access patterns and logical models. So when picking a storage format, remember that this data changes frequently and it's regularly updated and files have proven to be a slow approach.
So picking a database is most likely the correct approach here, simply because it is an easy way to access the data, especially when being refreshed and changed regularly. Secondly, when considering normalized versus de-normalized, as we previously mentioned, views provide a very strong way to do this. Although it's a little beyond the scope of this class, it's important to think is the data going to be refreshed in real time or more on a daily or scheduled basis?
A materialized view allows you to condense all of the data from all the tables into one de-normalized format, but unlike a regular view, which life course is the underlying tables, a de-normalized view sets up the data for batch processing. So in this case where the data is updated nightly with the daily sales figures a materialized view makes a lot of sense because it allows us through a batch refresh and then quick access of the underlying data. To put actual technology to this, a MySQL or Postgres database with a normalized table structure, supporting a materialized view that de-normalizes the data into a readable information structure is optimal.
Next, we'll wanna consider the actual mechanism required to view the data. We know that management wants something more dynamic than spreadsheets and that the data itself is updated nightly. So in something like this, a sales dashboard is key, products such as Amazon's quick site provide a quick way to connect directly to a database, while on other data engineering classes, we've discussed what it means to make an object relational model, available through a programming language for custom API interface.
If you're at a bigger company, you might already have tools such as Microsoft Power BI or Tableau available, whereas other more developer and programmer centric tools such as SageMaker or IPython Notebooks or Zeppelin notebooks provide great ways to quickly visualize and display underlying data and information. All of this leads up to helping management derive insights from the data.
When we think about it, some interesting questions start to appear, and we need to think about how to support these visualizations. One of the most common questions will be asked in this field is geolocation, where are your top sales coming from? From this aspect, think back to their data where you have things such as IP address, and maybe customer name of the company.
So what we're able to do is then use that data and combine it, aggregate it and enrich it, and then combine that aggregation with a clever display format of showing it on a world or country map, and then we're able to quickly and easily show where sales are coming from.
Other common questions for insights are things such as, at what time or day do our products sell best and which products sell best and worse, along with more complex things like profit margins. You might wanna pause and think about some of these questions, but the key in every case is raw data combined with enrichments and processing and human readability, along with all the other processing methods we discussed when combined with a visualization engine, such as QuickSight or Tableau, allow you to answer these questions quickly and efficiently to a large number of people.
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.