How do I Actually Build a Data Lake?
Start course

Many organizations have implemented data lakes to great success, giving them a tactical business edge through the use of data analysis and predictive analytics.

This course covers the basics of data lakes, how they are different from data warehouses, and the components that make up a successful data lake.

Learning Objectives

  • Understand the difference between data warehouses and data lakes
  • Know what qualities make up a good data lake.
  • Learn about AWS Lake Formation and how it can transform the process of creating a data lake from taking months to days

Intended Audience

This course is intended for anyone who is responsible for managing business data or for those interested in creating a data lake in general.


To get the most out of this course, you should have a decent understanding of cloud computing and cloud architectures, specifically with Amazon Web Services.


Ok, so how do I actually build a data lake?

So there are two ways you can actually go about creating your data lake. You can try to assemble all of these interconnected data lake pieces by hand; which can take quite a bit of know-how and a lot of time.

There are also a few deployable templates floating around from AWS that can help with this process - take a look over here to see a template and an architecture build guide:

Or we can use the AWS Lake formation service, which promises to make setting up your secure data lake take only a matter of days, instead of weeks or months.

It does this by identifying existing data sources within Amazon S3, relational databases, and NoSQL databases that you want to move into your data lake. It then will crawl and catalog and prepare all that data for you to perform analytic on. You can also target log files from things like CloudTrail, Kinesis Fire Hose, Elastic Load Balancers, and CloudFront. All this data can be grabbed all at once, or it can be taken incrementally. 

All of this functionally is managed by using ‘blueprints’ where you simply:

  1. Point to the source data
  2. Point where you want to load that data in the data lake
  3. Specify how often you want to load that data

And the blueprint:

  1. Discover the sources table schema
  2. Automatically converts to a new target format
  3. Partitions the data based on partitioning schema
  4. Keeps track of the data that was already processed.
  5. Allows you to customize all the above

AWS Lake formation will take care of user security by creating self-service access to that data through your choice of analytic services.

It does this by setting up users' access within lake formation, by tying data access with access control policies within the data catalog instead of with each individual service. So when a user comes to lake formation to see some data - their credentials and access roles are sent to lake formation, lake formation digests that and determines what data that person is allowed to access, and gives them a new token to carry with them that services like Athena, Redshift, and EMR will honor.

This allows you to define permissions once, and then open access to a range of managed services and have those permissions enforced.  

There is no additional pricing for using the Lake Formation service, but you do have to pay for all the services it uses though. This means you have to pay for any AWS Glue usage during the crawling and cataloging phases. You will have to pay for the data residency within S3. You will have to pay for any Athena queries you might make on the data when looking up information.

So while the orchestration of all the services doesn't cost anything, there are many fees that you should be aware of when architecting your solutions if cost is a concern.

About the Author

William Meadows is a passionately curious human currently living in the Bay Area in California. His career has included working with lasers, teaching teenagers how to code, and creating classes about cloud technology that are taught all over the world. His dedication to completing goals and helping others is what brings meaning to his life. In his free time, he enjoys reading Reddit, playing video games, and writing books.