This course introduces the DP-203 Exam Preparation: Data Engineering on Microsoft Azure learning path, which explores the main subject areas covered in Microsoft's DP-203 exam:
- Designing and implementing data storage
- Designing and developing data processing
- Designing and implementing data security
- Monitoring and optimizing data storage and data processing
Hello and welcome to Data Engineering on Microsoft Azure. The focus of this learning path is to prepare you for Microsoft’s DP-203 exam. My name’s Guy Hummel and I’m a Microsoft Certified Azure Solutions Architect and Data Engineer.
The DP-203 exam tests your knowledge of four subject areas: designing and implementing data storage, designing and developing data processing, designing and implementing data security, and monitoring and optimizing data storage and data processing. I’m not going to talk about every item in the exam guide, but I’ll go over some of the highlights of what you’ll need to know.
The first, and biggest, section of the exam guide is about designing and implementing data storage. By far the most important service to know for this section, and for the entire exam, is Azure Synapse Analytics. This service allows you to store and analyze huge amounts of data. Microsoft is making it the central service in its data analytics product line.
An alternative analytics service is Azure Databricks, which is a collaboration between Microsoft and Databricks, the company founded by the people who created Apache Spark. Naturally, Azure Databricks uses Spark, but you can also use Spark on Azure Synapse Analytics, so the two services are effectively competitors.
Although both of these services have their own storage capabilities, it’s common to use another service called Azure Data Lake Storage Gen2 to store their data. It’s essentially a layer on top of Azure Blob Storage. This layer adds features that make it especially useful for big data processing.
For the first section of the exam guide, you need to know how to design and implement solutions that use one or more of these three services. First, you need to be able to design and implement physical and logical data storage structures, so you need to know things like what folder structure and file types to use and how to archive the data.
Next, to make queries as fast and efficient as possible, you need to partition the data store into multiple shards.
Finally, you need to design and implement the serving layer. This includes database design topics, such as creating a star schema, handling slowly changing dimensions, and allowing for the retrieval of previous versions of data.
The next section of the exam guide is about designing and developing data processing solutions. This includes ingesting and transforming data, batch processing, stream processing, and managing batches and pipelines.
You need to know how to transform data using both Spark and SQL, and you need to be able to use the transformation features of Data Factory, Azure Synapse Pipelines, and Stream Analytics.
Data Factory is a data integration service. It lets you build pipelines that take data from one service, such as Azure Storage, transform it, and store it in another service, such as Synapse Analytics.
Azure Synapse Pipelines is essentially a subset of Data Factory that has been bolted onto Azure Synapse Analytics. So, if you know how to use Data Factory, you’ll know how to use Synapse Pipelines, too. They support both Spark and SQL-based transformations.
Azure Stream Analytics is an older service that uses a SQL-like query language.
For batch processing, you need to know how to do it using four different services: Data Factory, Azure Synapse Pipelines, Azure Databricks, and Data Lake Storage Gen2.
For stream processing, you need to be able to use Stream Analytics (a service that’s focused entirely on stream processing), Azure Databricks, and Azure Event Hubs, which is a service that can ingest massive amounts of streaming data and pass it on to either Stream Analytics or Azure Databricks.
Finally, you need to be able to manage batches and pipelines using Data Factory or Synapse Pipelines, which again, are very similar to each other.
The next section of the exam guide is about how to design and implement data security. This is not just about restricting access to data, which is very important, of course, but also protecting sensitive information using techniques such as applying data masking to credit card numbers or encrypting an entire database.
The final section of the exam guide is about monitoring and optimizing data storage and data processing. The most important service for the monitoring part of this section is, not surprisingly, Azure Monitor, which you can use to monitor and configure alerts for almost every other Azure service.
The optimization subsection doesn’t include new services. Instead, you need to know how to optimize the performance of services like Synapse Analytics, Data Factory, and Azure Databricks.
Please note that this learning path doesn’t cover the topics in the exact same order as the exam guide because there’s so much overlap between the different sections. For example, Azure Databricks is in every section of the exam guide, but we don’t have separate Azure Databricks courses for every section.
Now, are you ready to learn about Azure data engineering? Then let’s get started! To get to the next course in this learning path, click on the Learning Path pullout menu on the left side of the page. But please remember to rate this introduction before you go on to the next course. Thanks!
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).