How do I Actually Build a Data Lake?


SAA-C03 Introduction
RDS vs. EC2
RDS vs. EC2
Amazon DynamoDB
DynamoDB Accelerator
SAA-C03 Review
Start course
2h 52m

This section of the Solution Architect Associate learning path introduces you to the AWS database services relevant to the SAA-C03 exam. We then understand the service options available and learn how to select and apply AWS database services to meet specific design scenarios relevant to the Solution Architect Associate exam. 

Want more? Try a lab playground or do a Lab Challenge

Learning Objectives

  • Understand the various database services that can be used when building cloud solutions on AWS
  • Learn how to build databases using Amazon RDS, DynamoDB, Redshift, DocumentDB, Keyspaces, and QLDB
  • Learn how to create Elasticache and Neptune clusters
  • Understand AWS database costs
  • Learn about data lakes and how to build a data lake in AWS

Ok, so how do I actually build a data lake?

So there are two ways you can actually go about creating your data lake. You can try to assemble all of these interconnected data lake pieces by hand; which can take quite a bit of know-how and a lot of time.

There are also a few deployable templates floating around from AWS that can help with this process - take a look over here to see a template and an architecture build guide:

Or we can use the AWS Lake formation service, which promises to make setting up your secure data lake take only a matter of days, instead of weeks or months.

It does this by identifying existing data sources within Amazon S3, relational databases, and NoSQL databases that you want to move into your data lake. It then will crawl and catalog and prepare all that data for you to perform analytic on. You can also target log files from things like CloudTrail, Kinesis Fire Hose, Elastic Load Balancers, and CloudFront. All this data can be grabbed all at once, or it can be taken incrementally. 

All of this functionally is managed by using ‘blueprints’ where you simply:

  1. Point to the source data
  2. Point where you want to load that data in the data lake
  3. Specify how often you want to load that data

And the blueprint:

  1. Discover the sources table schema
  2. Automatically converts to a new target format
  3. Partitions the data based on partitioning schema
  4. Keeps track of the data that was already processed.
  5. Allows you to customize all the above

AWS Lake formation will take care of user security by creating self-service access to that data through your choice of analytic services.

It does this by setting up users' access within lake formation, by tying data access with access control policies within the data catalog instead of with each individual service. So when a user comes to lake formation to see some data - their credentials and access roles are sent to lake formation, lake formation digests that and determines what data that person is allowed to access, and gives them a new token to carry with them that services like Athena, Redshift, and EMR will honor.

This allows you to define permissions once, and then open access to a range of managed services and have those permissions enforced.  

There is no additional pricing for using the Lake Formation service, but you do have to pay for all the services it uses though. This means you have to pay for any AWS Glue usage during the crawling and cataloging phases. You will have to pay for the data residency within S3. You will have to pay for any Athena queries you might make on the data when looking up information.

So while the orchestration of all the services doesn't cost anything, there are many fees that you should be aware of when architecting your solutions if cost is a concern.

About the Author
Learning Paths

Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.

To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.

Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.

He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.

In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.

Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.