Working With Data Sets
The course is part of this learning path
This course focuses on understanding common data formats and interfaces. It explores some common data formats that you'll encounter as a data engineer. Basically, the goal is to develop a deep understanding of what the pros and cons of storing your data in different ways is. We're then going to focus on how to translate that high-level ethereal concept into a more concrete understanding and really showcase how the same dataset can be accessed and viewed differently if you were to just simply store it in a different fashion.
If you have any feedback relating to this course, please contact us at firstname.lastname@example.org.
- Learn about different data sources and formats, and how to model your data
- Get acquainted with the common data formats — CSV, XLM, and JSON — as well as specialized data formats
- Learn about databases and how to exchange data between applications
This course is suited to anyone looking to gain a practical, hands-on understanding of data modeling and for those who might want to change how they're storing their data.
To get the most out of this course, you should familiarize yourself with the concepts of what a CSV and a JSON is, along with databases at a high level.
Now that we've discussed how to actually store your data and get your data. Let's talk about how we're actually going to get that to people. Here is a typical scenario or as typical of a scenario as you can get in the world of data engineering. You have a lot of data stored in a number of different data sources and this might include XMLs, databases, CSVs, and so on. Several internal applications need access to all of that data. Some are interested in the customer data, others are interested in the sales data. Some people need data from both the XMLs and the database. And the question becomes, how do we start to give the data to the people without creating specific point to point interactions, which are only good for a one-off serving of a specific use case?
This is where APIs come in. You might be thinking, perhaps you can just use SQL or another query language, but the complexity is, if you have more than one type of data structure or data storage mechanism under the surface, you need to abstract it.
An API is an application programming interface that can really be set up to service a wide variety of underlying data storage technologies and formats. Think of an API, kind of like the menu in a restaurant. The menu provides you with a list of dishes you can order, along with the description of each dish. Now the restaurant in no way is going to tell you how they prepared the data. They won't even let you in the kitchen. But what you do know, is what you're getting in a reliable, repeatable fashion. You order a steak, you're going to get a steak placed on your table. You order information from the API, you're gonna get a responsive order information. You don't know where that information came from, but you do know that you got high quality data coming back to you.
Now, you can't talk about APIs without really starting with REST. There are a lot of other APIs out there, but especially over the last many, several years, REST has become extremely dominant. This stands for representational state transfer.
Now honestly, most of the web runs on REST APIs. This is key because it's highly flexible, widely accepted and great for transferring secure, useful information. So one of the main reasons REST is extremely valuable, is that it's stateless. What this means is that each call is an independent transaction. So basically you, the end user can access data in whatever pattern they want.
Now we could talk about authentication and how that's managed in cookies, but just know that if you're designing a REST API for your data, start to think in terms of how is somebody going to query it? And how does that relate to literally nothing else? What are common queries people want to ask and what are common responses you can service them with? And even bigger than statelessness, is the fact that REST works over URLs.
So what does that mean? Any web browser is using a REST API natively. So it's extremely common on the modern internet. Basically when you go to a site such as Cloud Academy, all you're making is a series of REST calls to access, send, authenticate yourself, and view the content. And that actually brings me to the last point I want to make about REST APIs, is that it supports a variety of different types of operations. And this is important to understand from when you're designing your data access layer.
Now there's a lot of courses on Cloud Academy that discuss how to make REST APIs, but at this level for a data engineer, think in terms of four commands. What resources do you want to get from the server? This is an HTTP get and in many cases, that's what a web browser will default to when making a call.
There's also posts. This is adding information to a collection, basically whereas a get retrieves information, a post sends information. A put is modifying a property that's already in your data. And the delete, like it sounds, deletes information from the data set. These four commands encapsulated with an interface that works over URLs really allows you to achieve most common data operations and without having to look too far beyond it.
So let's take a brief look at a very simple example in which data is returned as a JSON. You could specify the format of the data response from a REST API, but JSONs tend to be one of the most, if not the most popular. So here you have a service endpoint at the top. It's a get request endpoint. And simply by typing that into your browser, you get a sample response. And as you can see here, a Cloud Academy API REST example, responds with a very clean, easy to understand JSON. But now, if you were to try this yourself and enter this into your browser, you're also going to get a 404, because you're not authenticated. This shows the power of the REST API to handle stateless transactions from different users and provide meaningful responses.
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.