Working With Data Sets
The course is part of this learning path
This course focuses on understanding common data formats and interfaces. It explores some common data formats that you'll encounter as a data engineer. Basically, the goal is to develop a deep understanding of what the pros and cons of storing your data in different ways is. We're then going to focus on how to translate that high-level ethereal concept into a more concrete understanding and really showcase how the same dataset can be accessed and viewed differently if you were to just simply store it in a different fashion.
If you have any feedback relating to this course, please contact us at firstname.lastname@example.org.
- Learn about different data sources and formats, and how to model your data
- Get acquainted with the common data formats — CSV, XLM, and JSON — as well as specialized data formats
- Learn about databases and how to exchange data between applications
This course is suited to anyone looking to gain a practical, hands-on understanding of data modeling and for those who might want to change how they're storing their data.
To get the most out of this course, you should familiarize yourself with the concepts of what a CSV and a JSON is, along with databases at a high level.
The underlying part of all of these, however, is the concept of a data model. So it's important to know what a data model is, what goes into and how to make it before you could start to think what is the best data format for you or maybe how to change your existing data format. So the first question when asking about a conceptual data model is what are we really trying to keep track of?
Now this might seem a little simplistic, but it really does capture the essence of what we're after. Think, technology aside, from a purely logical view, what are you looking to keep track of, measure, record, or otherwise track? In data modeling terminology, we call this typically an entity, and then around an entity are things such as attributes. These are characteristics or measurements that describe what your entity is.
Now the concept of an entity as a comp and just in the ether is a little abstract. So let's pull it back down to earth. Let's take a customer, for example. Most businesses, if not every business, has a customer. This is your base-level entity. Now there's many attributes about the customer that describes it. Maybe it's a high-level attribute, such as, is it an individual, a business, or a government entity? Perhaps there are things such as names. Perhaps there's things such as salutations. And maybe there's internal tracking numbers that describe this customer, such as a customer number or a last sale date that you are keeping track of, but doesn't have much meaning outside of your organization. The point is, when building a data model is to keep track of what the thing you're keeping track of and then how do you describe it.
So once you have your entity defined, and honestly in many cases there's more than one entity type, we need to think about how they relate to each other, aka a relationship. Sticking with the sales application, maybe the customer is linking with a product. Now the product itself has its own complete list of attributes that describe the product and the relationship, in this case, might be a sale. In other words, when a customer purchases a product, a sale relationship is generated between the two of them. And don't think that there's only one type of relationship between these entities. Perhaps the customer has a product on their wishlist, or maybe they've purchased a product twice and that now needs to be represented with a different type of relationship. Or maybe a sale is its own entity that has transaction details and then that links to the customer and the product. There's a lot of ways this can expand, but for simplicity reasons, just think a relationship is how two entities relate to each other.
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.