The course is part of this learning path
This course focuses on understanding common data formats and interfaces. It explores some common data formats that you'll encounter as a data engineer. Basically, the goal is to develop a deep understanding of what the pros and cons of storing your data in different ways is. We're then going to focus on how to translate that high-level ethereal concept into a more concrete understanding and really showcase how the same dataset can be accessed and viewed differently if you were to just simply store it in a different fashion.
If you have any feedback relating to this course, please contact us at support@cloudacademy.com.
Learning Objectives
- Learn about different data sources and formats, and how to model your data
- Get acquainted with the common data formats — CSV, XLM, and JSON — as well as specialized data formats
- Learn about databases and how to exchange data between applications
Intended Audience
This course is suited to anyone looking to gain a practical, hands-on understanding of data modeling and for those who might want to change how they're storing their data.
Prerequisites
To get the most out of this course, you should familiarize yourself with the concepts of what a CSV and a JSON is, along with databases at a high level.
So to bring it all together, let's pull it into a real world scenario that's maybe a little on the complex side, but it starts to show how the storage technology can be accomplished for real world uses.
Imagine your company sells computer accessories and peripherals, and it also has an online store. Pretty classic, you can replace this with your own entity model if you've been following along, or just imagine whatever your company sells. Your analytics department has requested that the sales data be fed into the company's cloud data warehouse.
Now the online store uses a relational database, but the data warehouse goes beyond that and stores information from many systems beyond just the sales system. So the data warehouse has a couple of ways of getting information to it. Well, let's walk through how we can start to architect a system, and how these different components can interact with each other.
So your online store probably has an associated database to store sales information. This is a really strong use case for a relational database. Very clear expected data inputs of who is buying what, when, why, and where, and maybe some inventory tracking. This then relates to a data warehouse. Now this goes beyond just what your comparatively simple online stores database has. This will store lots of information and lots of different formats, but very importantly, it's a central repository for information. So the question then becomes, how do these two link up? And the answer is, an API such as a REST API.
Many data warehouses will have a system for gets, puts, and posts to be made in order to upload new information to them. Now, for those of you who have large data movement problems, we're talking in the tens, dozens, hundreds, or terabyte range, we might need to go with some alternative technologies. But for those of you who are getting started on your data journey, this architecture of a relational database within some type of execution engine such as a Lambda function, posting data to a REST API to store it in a larger data warehouse is extremely common and extremely supportable.
If you wanna know more about this subject, we actually have some labs associated with this class that will really help you dig in in a hands on fashion to actually play with, experiment, and interface with databases and data storage formats. Also, if you're interested in your general data engineering development or maybe even data scientists, pay attention to the learning path that this is part of because there's more classes that directly tie into the whole data engineer data professional experience.
As always, please send any of your feedback in, good or bad, it helps us to target this course content more accurately in the future. I look forward to instructing everyone on the next class, and I will see you then.
Lectures
Introduction - Data Sources and Formats - Modeling Your Data - CSV - XML - JSON - Specialized Data Formats - Databases - Exchanging Data Across Applications
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.