How do I Actually Build a Data Lake?


Course Introduction
RDS vs. EC2
RDS vs. EC2
DynamoDB Accelerator

The course is part of this learning path

Start course
4h 21m

This section of the AWS Certified Solutions Architect - Professional learning path introduces you to the AWS database services relevant to the SAP-C02 exam. We then understand the service options available and learn how to select and apply AWS database services to meet specific design scenarios relevant to the AWS Certified Solutions Architect - Professional exam. 

Want more? Try a Lab Playground or do a Lab Challenge

Learning Objectives

  • Understand the various database services that can be used when building cloud solutions on AWS
  • Learn how to build databases using Amazon RDS, DynamoDB, Redshift, DocumentDB, Keyspaces, and QLDB
  • Learn how to create ElastiCache and Neptune clusters
  • Understand which AWS database service to choose based on your requirements
  • Discover how to use automation to deploy databases in AWS
  • Learn about data lakes and how to build a data lake in AWS

Ok, so how do I actually build a data lake?

So there are two ways you can actually go about creating your data lake. You can try to assemble all of these interconnected data lake pieces by hand; which can take quite a bit of know-how and a lot of time.

There are also a few deployable templates floating around from AWS that can help with this process - take a look over here to see a template and an architecture build guide:

Or we can use the AWS Lake formation service, which promises to make setting up your secure data lake take only a matter of days, instead of weeks or months.

It does this by identifying existing data sources within Amazon S3, relational databases, and NoSQL databases that you want to move into your data lake. It then will crawl and catalog and prepare all that data for you to perform analytic on. You can also target log files from things like CloudTrail, Kinesis Fire Hose, Elastic Load Balancers, and CloudFront. All this data can be grabbed all at once, or it can be taken incrementally. 

All of this functionally is managed by using ‘blueprints’ where you simply:

  1. Point to the source data
  2. Point where you want to load that data in the data lake
  3. Specify how often you want to load that data

And the blueprint:

  1. Discover the sources table schema
  2. Automatically converts to a new target format
  3. Partitions the data based on partitioning schema
  4. Keeps track of the data that was already processed.
  5. Allows you to customize all the above

AWS Lake formation will take care of user security by creating self-service access to that data through your choice of analytic services.

It does this by setting up users' access within lake formation, by tying data access with access control policies within the data catalog instead of with each individual service. So when a user comes to lake formation to see some data - their credentials and access roles are sent to lake formation, lake formation digests that and determines what data that person is allowed to access, and gives them a new token to carry with them that services like Athena, Redshift, and EMR will honor.

This allows you to define permissions once, and then open access to a range of managed services and have those permissions enforced.  

There is no additional pricing for using the Lake Formation service, but you do have to pay for all the services it uses though. This means you have to pay for any AWS Glue usage during the crawling and cataloging phases. You will have to pay for the data residency within S3. You will have to pay for any Athena queries you might make on the data when looking up information.

So while the orchestration of all the services doesn't cost anything, there are many fees that you should be aware of when architecting your solutions if cost is a concern.

About the Author
Learning Paths

Danny has over 20 years of IT experience as a software developer, cloud engineer, and technical trainer. After attending a conference on cloud computing in 2009, he knew he wanted to build his career around what was still a very new, emerging technology at the time — and share this transformational knowledge with others. He has spoken to IT professional audiences at local, regional, and national user groups and conferences. He has delivered in-person classroom and virtual training, interactive webinars, and authored video training courses covering many different technologies, including Amazon Web Services. He currently has six active AWS certifications, including certifications at the Professional and Specialty level.