Many organizations have implemented data lakes to great success, giving them a tactical business edge through the use of data analysis and predictive analytics.
This course covers the basics of data lakes, how they are different from data warehouses, and the components that make up a successful data lake.
- Understand the difference between data warehouses and data lakes
- Know what qualities make up a good data lake.
- Learn about AWS Lake Formation and how it can transform the process of creating a data lake from taking months to days
This course is intended for anyone who is responsible for managing business data or for those interested in creating a data lake in general.
To get the most out of this course, you should have a decent understanding of cloud computing and cloud architectures, specifically with Amazon Web Services.
What makes up a good data lake?
A good data lake will deal with these five challenges well: Storage (the lake itself), Data Movement (how the data gets to the lake), Data Cataloging and Discovery(finding the data and classifying it), Generic Analytics (making sense of that data), and Predictive analytics ( making educated guesses about the future based on the data).
Storage: Let's take a look at storage first. The reason people moved into using data lakes was that storage costs were becoming burdensome because the sheer volume of data was starting to crush people. What service does AWS offer that can easily deal with large and crushing volumes of raw data? Well, the first thing I would think about would be something like S3.
S3 is particularly good in this scenario, not only because it can deal with the large volume of data, but also because it can handle unstructured data. You could fill it with log files, json transaction documents, blobs of binary output, it takes anything. A normal database would not be particularly suitable for this task.
The other benefit of using S3 is that we can set up lifecycle policies to help deal with the cost of the ever-increasing data burden. This allows us to put infrequently accessed data into a cheaper storage tier, and even to eventually put it into glacier ( the deep archival service) when we are fairly certain that, that data isn't going to be used for a long while. We can of course return the cold data back into S3 standard if we ever need to.
Data Movement: Another important thing to figure out when building your data lakes is how the heck are you planning to actually get your data into it. We know that S3 is what we should use for storage, but what mechanisms do we want to use to get all the stuff… into s3?
We could of course manually move large folders of archived log data into whatever bucket we are using for our data lake, but that idea is not super scalable and honestly just feels bad. It would be great to automatically push our business data into this bucket.
There are a few ways of getting your data into your bucket, be it from actively streaming your data with kinesis, to using a direct connection from on-premises to bring in large quantities of data, or using the database migration service to move your database information into s3, or you might even have to have snowball devices delivered to some faraway outpost once a month to collect research data to have sent back to AWS.
Whatever your method, you will need a way to move your data into AWS, and you will prefer that whatever way you use is automated.
Data Cataloging and Discovery: Once you have all of your data within your data lake (your s3 bucket of choice) It becomes necessary to start cataloging and understanding the types of data you have. If we do not spend at least a little time working through our data and managing it, we will quickly turn our data lake into a data swamp.
Think about what would happen as you add terabytes to petabytes of data, folders, and folders of the stuff, into the same bucket. As you do this over long periods of time, your knowledge of what is what and where it lives will fade. This makes it near impossible for anyone else to find specific data sets they want to work with.
This is why we need to catalog our data. We need to create some data about the data - metadata. This will help future persons discover what it is they need from our data lake, without them having to spend hours, days, or weeks trying to figure out where or what it is.
Things that might be helpful to know for example is what formats are the various data stored in - is it mostly JSON, CSV, Parquet… it is compressed data, is it sensitive data? And maybe you might want to add additional tags, like this is data from Twitter, or from customer reviews.
There are many ways you can go about this to create your own data catalog. For example, you might have an upload event on your s3 bucket that triggers a lambda function to store some metadata information in DynamoDB about the new data that was just uploaded.
From there we could push that information into ElasticSearch to browse through and query that data. This is a very do-it-yourself approach and could be a little tricky to get set up correctly.
I would recommend instead that you take a look at AWS glue. This service is a managed transform engine that allows you to run ELT pipelines - but for our uses, it also contains a very robust data catalog that we can leverage.
The glue data catalog even contains built-in crawlers that can crawl through various data sources and automatically populate the catalog for you. This includes your S3 buckets, databases, and data warehouses. They can be scheduled to run at certain times or based on events like new upload into that s3 bucket.
Analytics: Why would we be collecting all this data if we did not want to know information about that data. Our data is a record of the past and that record can give us great insights into what was successful and what was a failure for our business.
There are a number of great AWS services that can help you start to make sense of your data. These services range in their analytical ability and what their goals are.
For example, if you wanted to get some real-time information about your data lake, or at least the information being streamed into it from kinesis or Amazon MSK for example, you can use Kinesis Data Analytics to get a real time feed of what your streaming data is up to.
If you were looking to interactively scrub through your data we have Amazon Athena, a purposely built service that makes it easy to analyze data in Amazon S3 using standard SQL.
If you have some section of your data that you want to create dashboards and graphs for, that's where something like Amazon Quicksight can be added to your solution.
And, we also have data warehousing services like redshift that you can place a subset of our data lake within to perform general analytics on to try and derive some meaning from that data.
Predictive analytics: Being able to perform predictive analytics will allow you to gain some possible future insight into your business though your data. You can start to build out systems that help with this through the use of machine learning services.
One of the most important things for machine learning is having a robust data set to work with. This is why it works so well to have a data lake where you can pull subsets of data from.
Amazon offers AWS sage maker as a quick way to get into creating, training, and running your own models within AWS.
Additionally, aws has a series of deep learning AMI that come pre-configured with popular deep learning frames and interfaces. This included TensorFLow, PYtorch, Apache MXNet, Chainer, Gluon, Horovod, and Kera. There are no additional charges for using these AMIs, they are still pay-as-you-go like other instance types.
William Meadows is a passionately curious human currently living in the Bay Area in California. His career has included working with lasers, teaching teenagers how to code, and creating classes about cloud technology that are taught all over the world. His dedication to completing goals and helping others is what brings meaning to his life. In his free time, he enjoys reading Reddit, playing video games, and writing books.