RDS vs. EC2
Amazon RDS Costs
Data Lakes in AWS
The course is part of this learning path
This section of the Solution Architect Associate learning path introduces you to the AWS database services relevant to the SAA-C03 exam. We then understand the service options available and learn how to select and apply AWS database services to meet specific design scenarios relevant to the Solution Architect Associate exam.
Want more? Try a lab playground or do a Lab Challenge!
- Understand the various database services that can be used when building cloud solutions on AWS
- Learn how to build databases using Amazon RDS, DynamoDB, Redshift, DocumentDB, Keyspaces, and QLDB
- Learn how to create Elasticache and Neptune clusters
- Understand AWS database costs
- Learn about data lakes and how to build a data lake in AWS
What makes up a good data lake?
A good data lake will deal with these five challenges well: Storage (the lake itself), Data Movement (how the data gets to the lake), Data Cataloging and Discovery(finding the data and classifying it), Generic Analytics (making sense of that data), and Predictive analytics ( making educated guesses about the future based on the data).
Storage: Let's take a look at storage first. The reason people moved into using data lakes was that storage costs were becoming burdensome because the sheer volume of data was starting to crush people. What service does AWS offer that can easily deal with large and crushing volumes of raw data? Well, the first thing I would think about would be something like S3.
S3 is particularly good in this scenario, not only because it can deal with the large volume of data, but also because it can handle unstructured data. You could fill it with log files, json transaction documents, blobs of binary output, it takes anything. A normal database would not be particularly suitable for this task.
The other benefit of using S3 is that we can set up lifecycle policies to help deal with the cost of the ever-increasing data burden. This allows us to put infrequently accessed data into a cheaper storage tier, and even to eventually put it into glacier ( the deep archival service) when we are fairly certain that, that data isn't going to be used for a long while. We can of course return the cold data back into S3 standard if we ever need to.
Data Movement: Another important thing to figure out when building your data lakes is how the heck are you planning to actually get your data into it. We know that S3 is what we should use for storage, but what mechanisms do we want to use to get all the stuff… into s3?
We could of course manually move large folders of archived log data into whatever bucket we are using for our data lake, but that idea is not super scalable and honestly just feels bad. It would be great to automatically push our business data into this bucket.
There are a few ways of getting your data into your bucket, be it from actively streaming your data with kinesis, to using a direct connection from on-premises to bring in large quantities of data, or using the database migration service to move your database information into s3, or you might even have to have snowball devices delivered to some faraway outpost once a month to collect research data to have sent back to AWS.
Whatever your method, you will need a way to move your data into AWS, and you will prefer that whatever way you use is automated.
Data Cataloging and Discovery: Once you have all of your data within your data lake (your s3 bucket of choice) It becomes necessary to start cataloging and understanding the types of data you have. If we do not spend at least a little time working through our data and managing it, we will quickly turn our data lake into a data swamp.
Think about what would happen as you add terabytes to petabytes of data, folders, and folders of the stuff, into the same bucket. As you do this over long periods of time, your knowledge of what is what and where it lives will fade. This makes it near impossible for anyone else to find specific data sets they want to work with.
This is why we need to catalog our data. We need to create some data about the data - metadata. This will help future persons discover what it is they need from our data lake, without them having to spend hours, days, or weeks trying to figure out where or what it is.
Things that might be helpful to know for example is what formats are the various data stored in - is it mostly JSON, CSV, Parquet… it is compressed data, is it sensitive data? And maybe you might want to add additional tags, like this is data from Twitter, or from customer reviews.
There are many ways you can go about this to create your own data catalog. For example, you might have an upload event on your s3 bucket that triggers a lambda function to store some metadata information in DynamoDB about the new data that was just uploaded.
From there we could push that information into ElasticSearch to browse through and query that data. This is a very do-it-yourself approach and could be a little tricky to get set up correctly.
I would recommend instead that you take a look at AWS glue. This service is a managed transform engine that allows you to run ELT pipelines - but for our uses, it also contains a very robust data catalog that we can leverage.
The glue data catalog even contains built-in crawlers that can crawl through various data sources and automatically populate the catalog for you. This includes your S3 buckets, databases, and data warehouses. They can be scheduled to run at certain times or based on events like new upload into that s3 bucket.
Analytics: Why would we be collecting all this data if we did not want to know information about that data. Our data is a record of the past and that record can give us great insights into what was successful and what was a failure for our business.
There are a number of great AWS services that can help you start to make sense of your data. These services range in their analytical ability and what their goals are.
For example, if you wanted to get some real-time information about your data lake, or at least the information being streamed into it from kinesis or Amazon MSK for example, you can use Kinesis Data Analytics to get a real time feed of what your streaming data is up to.
If you were looking to interactively scrub through your data we have Amazon Athena, a purposely built service that makes it easy to analyze data in Amazon S3 using standard SQL.
If you have some section of your data that you want to create dashboards and graphs for, that's where something like Amazon Quicksight can be added to your solution.
And, we also have data warehousing services like redshift that you can place a subset of our data lake within to perform general analytics on to try and derive some meaning from that data.
Predictive analytics: Being able to perform predictive analytics will allow you to gain some possible future insight into your business though your data. You can start to build out systems that help with this through the use of machine learning services.
One of the most important things for machine learning is having a robust data set to work with. This is why it works so well to have a data lake where you can pull subsets of data from.
Amazon offers AWS sage maker as a quick way to get into creating, training, and running your own models within AWS.
Additionally, aws has a series of deep learning AMI that come pre-configured with popular deep learning frames and interfaces. This included TensorFLow, PYtorch, Apache MXNet, Chainer, Gluon, Horovod, and Kera. There are no additional charges for using these AMIs, they are still pay-as-you-go like other instance types.
Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.
To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.
Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.
He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.
In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.
Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.