Data Sources and Formats
Working With Data Sets
The course is part of this learning path
This course focuses on understanding common data formats and interfaces. It explores some common data formats that you'll encounter as a data engineer. Basically, the goal is to develop a deep understanding of what the pros and cons of storing your data in different ways is. We're then going to focus on how to translate that high-level ethereal concept into a more concrete understanding and really showcase how the same dataset can be accessed and viewed differently if you were to just simply store it in a different fashion.
If you have any feedback relating to this course, please contact us at firstname.lastname@example.org.
- Learn about different data sources and formats, and how to model your data
- Get acquainted with the common data formats — CSV, XLM, and JSON — as well as specialized data formats
- Learn about databases and how to exchange data between applications
This course is suited to anyone looking to gain a practical, hands-on understanding of data modeling and for those who might want to change how they're storing their data.
To get the most out of this course, you should familiarize yourself with the concepts of what a CSV and a JSON is, along with databases at a high level.
So to dive right into the meat of the discussion rather than who I am and who we are, let's talk about the types of data sources and formats. At the beginning, the one most people started with is unstructured data. This is simply a thing such as a plain text document, an email message, or in some ways, a Word document. This is considered unstructured because they don't inherently have the concept of enforcing a data schema or otherwise structure onto the data. They give you a lot of flexibility because you can jam whatever content you want in there, but on the flip side, they're hard to process and parse because there's no standard, particularly if you're trading between companies, departments or even just yourself on a later day.
Typically, when it comes to storing more than just freeform text, people then jump to semi-structured. These are things like CSVs, XMLs, and to an extent, JSON files. These don't necessarily have a predefined schema, although, many of them have a way of defining a schema, but they do force everybody to put the data into a record or a row, and there's at least a common way to delineate and separate different bits of information, especially with things like JSONs, which are a little more on the structured side versus CSVs. You can really start to refer to data by the field name or within a larger data structure. But very importantly, these semi-structured approaches still allow a great deal of flexibility in how you're storing the data.
Finally, is the structured data formats. Now these are things like relational databases, but it also extends to format such as Avro, Parquet or maybe if you're a Hadoop enthusiast, ORC. We'll discuss these a little bit later, but structured data sources, like you're seeing on the right here, are by far the most efficient and fastest to work with. And remember, when we say it's faster or more performing, that doesn't necessarily mean it's easier. In fact, it's often quite the opposite. The more performative ways of accessing data oftentimes actually require the most upfront setup, so it becomes a balance between flexibility and rapid use versus needing a more high performance planned approach.
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.