Amazon MSK and Kafka Under the Hood
Start course

This brief course covers the fundamentals of Amazon MSK, including what the service is, how it works, and how to provision an Amazon MSK cluster. You will also be guided through how Amazon MSK fits into a functional architecture.

If you have any feedback relating to this course, feel free to reach out to us at

Learning Objectives

  • Learn about the Amazon MSK service and how it works
  • Learn how to provision an MSK cluster
  • Understand how Amazon MSK fits into a functional architecture

Intended Audience

This lecture is perfect for anyone with no previous knowledge of Amazon MSK, who wants to learn more about the service, as well as those who are interested in taking the AWS Certified Data Analytics - Specialty (DAS-C01) Certification.


To get the most out of this course, you should have a basic general understanding of cloud computing, preferably with Amazon Web Services experience. It would also be beneficial to have some basic knowledge of streaming data services such as Amazon Kinesis and Apache Kafka.


Everything you need to know about Kafka boils down to three main ideas. You have producers who create data, such as a website gathering user traffic flow information, you have topics which received the data, this information is stored with extreme fault tolerance and you have consumers which can read that data in order and know that it was never changed or modified along the way.

Kafka is often used as a decoupling mechanism to help relieve tension among many different producers and consumers. For instance, you might have 10 websites, all creating log information that needs to be processed.

Let's say that you also have 20 microservices that each try to filter out and make predictions for various specific variables of that data. If you were to hard code all this information, you would have 200 separate connections that you need to worry about.

By using Kafka as an intermediary, all of that log information can be pushed into a single topic. This one topic is now the single source of truth for all of your microservices. They can each read through and gather the information they require on demand. This topic will hold the producers information until the retention period has been met. This window is configurable and has a default time of seven days.

Kafka also has a size-based retention policy where you configure the maximum amount of data that can be stored. Once the max amount of data has been reached, Kafka will start kicking out and removing old information. Both of these options can be configured on a per topic basis, which provides a lot of flexibility in keeping data costs down or to retain high value information for longer.

Each topic has a number of partitions where the data will be randomly written unless a partition key is provided. Once data has been written to a topic, it can never be changed. You can provide an update to that data, but it would just be the next entry in the partition instead of overriding the original data. The more partitions you have for a topic, the more parallelism you can have.

About the Author

William Meadows is a passionately curious human currently living in the Bay Area in California. His career has included working with lasers, teaching teenagers how to code, and creating classes about cloud technology that are taught all over the world. His dedication to completing goals and helping others is what brings meaning to his life. In his free time, he enjoys reading Reddit, playing video games, and writing books.