Working With Data Sets
The course is part of this learning path
This course focuses on understanding common data formats and interfaces. It explores some common data formats that you'll encounter as a data engineer. Basically, the goal is to develop a deep understanding of what the pros and cons of storing your data in different ways is. We're then going to focus on how to translate that high-level ethereal concept into a more concrete understanding and really showcase how the same dataset can be accessed and viewed differently if you were to just simply store it in a different fashion.
If you have any feedback relating to this course, please contact us at firstname.lastname@example.org.
- Learn about different data sources and formats, and how to model your data
- Get acquainted with the common data formats — CSV, XLM, and JSON — as well as specialized data formats
- Learn about databases and how to exchange data between applications
This course is suited to anyone looking to gain a practical, hands-on understanding of data modeling and for those who might want to change how they're storing their data.
To get the most out of this course, you should familiarize yourself with the concepts of what a CSV and a JSON is, along with databases at a high level.
And although the details are out of the scope with this class. It's worth mentioning some of the more specialized data formats. The first one to look at is Apache Avro. Now this is a structured data format that attempts to address some of the gaps between semi-structured data that we've looked at like JSON, and going to a full database. In the case of Apache Avro, the data's defined by schema, which confusingly enough is written in JSON. And then the data itself is compressed and stored in binary format.
Now this is an extremely efficient and fast way of handling and processing data programmatically. And it might feel like you're getting the best of both worlds. The downside of this is that you start to need to do a lot more planning.
First you need to actually build a schema, make sure it validates, and then you have to go through and actually use a Apache Avro editor. Typically, Avros are a lot harder to just simply view and interact with than you would with a human readable format, such as a CSV or XML. And finally, we have Apache Parquet and ORC sometimes just called ORC. These are columnar storage. You might've seen these, particularly if you've worked with Hadoop or some of the products in that ecosystem, such as Spark, which really isn't Hadoop, but that's a discussion for another class. Basically these focus on storing data in columns.
Now this drastically reduces the amount of disk IO and reduces the amount of data being loaded from disc, but just know that these, again, like Avro require quite a bit more set up upfront. That being said, if you are taking this class with the intent of going into data science, I would recommend looking into Parquet a little bit more because it's extraordinarily common with the Spark framework and a few other machine learning frameworks.
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.