Understanding Data File Formats
The course is part of this learning path
This course explores various data file formats that are used for data analytics, big data, and machine learning. So this course is ideal for you if you're looking to understand which file type you should use for your big data or analytic pipelines and make a decision on which file type is right for your workload.
- Understand the pros and cons of Apache ORC, Apache Parquet, AVRO, CSV, and JSON file types
- Learn which data file format best suits your needs
This course is for anyone who wants to learn about data formats and file types, and which ones are right for their workloads.
To get the most out of this course, you should have some background knowledge of databases, data information systems, and data files.
There are a number of data file formats available for you to use in your big data, machine learning, analytics, or data lake workloads. Understanding which one to use can make a huge difference in regards to speed, data size, data safety, and overall ease of use.
The formats I want to go over today are popular in all of these spaces. Even if you are not looking to do any of those things, being able to talk about these file formats and their trade-offs will help your AWS career in general.
Today we are going to go over the pros and cons of: Apache ORC, Apache Parquet, AVRO, CSV, and JSON.
Each of these formats have their own pluses and minuses that we are going to have to take a look at. In general, the main three you should pay attention to are ORC, Parquet, and AVRO - the other two, CSV and JSON can be used - but are very basic formats and do not have a lot of quality of life features.
Let's just get started.
Your data has probably been collated from many disparate sources, and will initially exist in many different formats. This raw data will generally take up a lot of space, It will not be parallelizable, and it will be hard to move around in general.
To properly work with your data, you will need to start grouping it together and synthesize it into one cohesive type. This will help with compressing your data, making it chunkable, and overall will increase performance.
Now, Data has weight, and with weight comes a sort of inertia. That's why it is important to make an educated decision about what you will need now, and in the future, so you don't run into long-term problems down the road. Because once you have settled on a file format, it becomes increasingly more difficult to change from it as your data catalog grows.
You are probably asking, why do I even need to pick a file format for my data, when I could just normalize all of it, and store it within a database?
Depending on the amount of data you have, it is entirely possible to store all of it within a normal database, something like RDS for example. Keeping your data in hot storage, as this is called, gives you a lot of power over that data.
When your data is set up in this way, It becomes easily accessible, searchable, and able to have work performed on it. The downside of course is that it is very expensive and can be difficult to maintain. Imagine the cost alone of housing tens or hundreds of terabytes of data within a database.
So that leaves us with keeping our data within a well-maintained file, or series of files that we can load and swap into our solutions as needed.
Let’s go over some of the important questions and scenarios you need to know about in order to make this decision.
One of the most important choices to look at when selecting a data file format is making sure that it is supported by whatever service or system you are planning to use. Not all formats are acceptable by all applications.
You can write or use converting tools to help you transition data from one to another, but it is quite a painful process to be honest.
For this lecture we will be discussing data formats that are all available for hadoop, to make things easier, but if some other more specific technology comes along and needs a different format, that kind of excludes a lot of choices.
A big part of choosing the correct data format is understanding the difference between a columnar based format and a row based one. There is a large difference in the way the data is stored for both. This means that writing and reading that data comes at a cost depending on which one you have chosen.
Here is some sample data that can help to illustrate the difference between these two storage mechanisms.
In our data set we have 4 columns - Customer names, Customer IDs, the total amounts they have purchased, and the reward points they have earned.
When putting this information into a data format, you can input it row by row like the following:
Juan,1,200.20,20; Jim,2,.50,0 and so forth…
Or we can input it into the file column by column.
Juan, Jim, Jane, Jaipal; 1,2,3,4; yadda yadda.
Here is a high-level example of what the information looks like - but with colors.
You can see with a row based system, the groups of data get spread out all along the entire file. This means it takes the computer more time to find and read all of those specific data points.
With a column based system, all of the same data is grouped together, making it easier to gather an entire subset of data about a specific topic.
In general a column based format is most useful when you are building queries that only need a small selection of columns from your entire dataset. Instead of having only the 4 columns, let's add a couple more to illustrate the point.
Now imagine you wanted to see the relationship between customers, their rewards points, and if they liked dogs; we can pull all of that information with just three columns.
This means that when the computer is reading through the file, it can skip all the columns that are unnecessary. And when the reader does find a column it wants, it gets to read all that information in sequential order.
From there you could pare the data down to specific customers pretty easily if you wanted - or take the raw data and see what relationships you can find by performing some data analysis on it.
There are a lot of times when you want to examine much more than just three columns in a dataset at once. In fact, you might want to get data for EVERY column. This would be very common if you were looking up specific customers, and needed their entire account information.
Situations where you need to look up many fields of any particular entry are a good use of a row based file format. This becomes even more relevant if you want to get whole groups of entries at a time.
This trade-off between row vs column data storing is important and is something that will help to shape your decision when determining which file format is right for you.
With that in mind, let's take a look at what each of the formats we are looking over actually supports.
We can see here that CSV, JSON, and Avro all are row based file formats, while ORC and Parquet are Column based formats. I think this is a good first look into how to make a decision on what file format is best for you to use. There is no one perfect format for all situations, but I think this is one of the best deciding opportunities we are going to cover.
Is my workload more Read intensive or Write intensive?
This is a common question that is asked often when dealing with large quantities of data. Understanding how you are manipulating your data, can greatly increase your ability to handle it and move it around.
Asking This question is also very familiar to standard database workloads where you might choose one database solution over another based on its read or write performance.
And just like those databases you might choose from, these file types also have differences in this regard.
Since we just talked about Column vs row formatting I think we can start here. In general column-based data stores are built for read-heavy workloads, and row-based data stores are good for write-heavy workloads.
And again this is because with column-oriented - you can read a lot of data quickly when you are concerned about just a few elements; as you retrieve the entire column sequentially. Where the row-based formats would have to iterate through a lot of data to retrieve that exact same information.
However, this is reversed when you need to write a lot of transnational type data, where many elements are added all at once. The columnar store has to traverse through all the columns to write the data. While the row-based system can just add everything sequentially.
So this table does looks very similar to the previous one.
The one separator I will add between ORC and Parquet is that ORC is optimized for HIVE while parquet is optimized for Spark. So if you are using one of those for your data analytic needs - this should help break that tie.
How well does your file need to compress?
This problem can be very subtle at first when you are just getting started. You might not really care or even notice file sizes as you begin your journey into data analytics, big data, machine learning or the like. However, at some points, your files are going to start to get big. Like disgustingly big - or at least have that potential.
So its probably important to think about how bad a really big file would be for your setup. Are you budget concerned and need to keep things to a managed financial liability? Is your solution space limited and you are conscious about staying under a certain threshold?
Each of these file types have different ways they deal with compression. As we all know compression and decompression can add quite a lot of strain on the CPU and therefore makes things run slower. So that is another point to think about - if you are compressing the file, can your solution handle that slow down.
With these vague questions floating around, let's take some time to understand compression.
First off, we will always get better compression with a columnar format than with row format. This is because all of the data in a column is of the same type: for example all strings, or all ints, or all bools.
And the Compression is even better when these columns are sparsely populated - meaning they tend to be filled with null values indeed of actual data.
Row-based formats will generally have differing variable types throughout the row, so are not as easy to compress.
Besides the row vs column compression side of things, there is also the compression codec that plays a role in how well it performs.
Here is a short list of the most common compression codecs that you will find in the space.
Each of these CODECS offer varying degrees of compression and CPU costs.
You can use one of these compression algorithms with your choice of data format to customize for what you want. In general Orc and Parquet offer similar compression ratios when compared using the same compression codec. Avro, JSON, and CSV will be worse then both orc and parquet because of their row based lifestyles.
Does the data file need to be human readable?
This is a bit of an off case, but is still something to think about. If you need something to be both human and machine readable, there is normally a trade off in functionality.
For example: CSV and JSON are very human readable formats. You can pop open the file and find whatever data you want by hand ( more or less). But CSV data must be flat, it is not particularly efficient, and will not support nested data.
JSON is a touch better, as it can have nested data, but does become difficult for humans to read if there is a lot of it.
ORC, Parquet, and Avro all are filled with machine readable content. This means all of their data is stored in machine language, and is incomprehensible to humans. There is a small caveat with AVRO - as its headers are stored in plain text with the main information stored as binary.
Here is a chart comparing the readability our choices.
In general, this criteria is a little bit of a deal breaker in some ways: Having a solution that absolutely requires human readability from the file format itself, is not a good solution in my personal opinion.
However, it does come up. Generally these types of scenarios are not scalable and probably have a fundamental weakness if it requires humans to read the file at all.
Something to think about I suppose.
Does your file need to support nested data?
We just asked this question a moment ago in regards to readability, but it is also a consideration in general. Is your source data filled with nested content?
Do you need to have access to that nested data often, or just occasionally?
For example, Parquet would be a great choice if you have complicated nested data that you need to access frequently, because it has been built from the ground up to deal with complex nested data structures.
It stores its elements in a tree, and It uses the record shredding and assembly algorithm to help speed search - you can find more information about that here in the Dremel Paper: https://research.google/pubs/pub36632/
But basically, this algorithm lets you get the nested data and retrieve it from the sub-column without pulling the entire record.
This greatly increases retrieval times and allows you to deal with your complex data efficiently.
ORC and Avro can also deal with nested data, but they were not explicitly built with this kind of functionally in mind. They have other high-priority objectives like compression, or splittability as its most important feature.
And like we previously mentioned, CSV files are not going to work at all for this. And JSON is a yes, but it wasn't exactly purposely built for these types of jobs either.
In fact, as we go along our data file tour, I want you to start leaning away from JSON and CSV as they are the bare-bones basic file formats that don't really add a whole lot of quality of life.
Let’s take a look at the table here and then we can move on.
How well can the file be split?
At some point in your data analytics or big data-related life, you will make a file or will be in charge of one that has become heinously large. Your ability to break that file into smaller chunks can be the difference between success and failure.
Parallelization of your data processing can greatly improve performance, and that requires a file type that is open to being split.
When working with a distributed file system, like hadoop for example, it is very important that a file can be split into multiple pieces. Hadoop stores data in chunks, and when your file is splittable, hadoop can distribute chunks of your file around and operate on them in parallel.
If you have a filetype that is not splittable - you would be in charge of manually splitting your file into multiple other files to increase parallelism.
Our big three file formats: AVRO, Parquet, and ORC are all natively splittable and generally make things pretty easy in that regard.
CSV is splittable when it is in its raw form, not when it is compressed. So you would have to deal with huge uncompressed files to get this functionality.
JSON is splittable when using JSON lines. This is when each entry in the JSON file is followed by a new line character.
William Meadows is a passionately curious human currently living in the Bay Area in California. His career has included working with lasers, teaching teenagers how to code, and creating classes about cloud technology that are taught all over the world. His dedication to completing goals and helping others is what brings meaning to his life. In his free time, he enjoys reading Reddit, playing video games, and writing books.