Azure Data Implementation
The course is part of these learning pathsSee 3 more
Microsoft Azure offers services for a wide variety of data-related needs, including ones you would expect like file storage and relational databases, but also more specialized services, such as for text searching and time-series data. In this course, you will learn how to design a data implementation using the appropriate Azure services. Two services that are especially important are Azure SQL Database and Azure Cosmos DB.
Azure SQL Database is a managed service for hosting SQL Server databases (although it’s not 100% compatible with SQL Server). Even though Microsoft takes care of the maintenance, you still need to choose the right options to scale it and make sure it can survive failures.
Azure Cosmos DB is the first multi-model database that’s offered as a global cloud service. It can store and query documents, NoSQL tables, graphs, and columnar data. To get the most out of Cosmos DB, you need to know which consistency and performance guarantees to choose, as well as how to make it globally reliable.
Identify the most appropriate Azure services for various data-related needs
Design an Azure SQL Database implementation for scalability, availability, and disaster recovery
Design an Azure Cosmos DB implementation for cost, performance, consistency, availability, and business continuity
People who want to become Azure cloud architects
People preparing for a Microsoft Azure certification exam
General knowledge of IT architecture, especially databases
SQL Server has been one of the most popular relational databases for a very long time, so it’s not surprising that Microsoft has made it available as a managed Azure service as well. But Azure supports other relational databases too. Microsoft offers a surprisingly high level of support for the most popular open-source relational databases, MySQL and PostgreSQL.
If you have existing databases running on either of these systems and you want to migrate them to Azure, then you should use Azure Database for MySQL and Azure Database for PostgreSQL. These are managed services that give you lots of extra goodies, including high availability, backups, security, and compliance. There’s even a free service that can migrate your existing databases to Azure with minimal downtime. The managed service price is quite reasonable, so there isn’t much reason to run these databases directly on virtual machines, because you’d have to manage them yourself.
OK, back to SQL Server. The most obvious way to migrate your SQL Server databases is to use Azure SQL Database. However, if you’re currently running SQL Server 2016 or higher, there’s another option you may want to consider. It’s called SQL Server Stretch Database.
It’s very similar to StorSimple, except it’s for database records instead of files. It migrates cold table rows (that is, infrequently queried rows) to Azure. This happens in the background automatically. And you can still query the data that has migrated to Azure too. The only difference is that the query will take a bit longer than usual.
Stretch Database is a good solution if your on-site data keeps growing. First, instead of buying more storage to handle the growth, you can just migrate your cold data to Azure, which is usually cheaper and easier. Second, as your data grows, backups take longer to run. Eventually, your backups may not be able to run within your backup window. By migrating your cold data to the cloud, your on-site backups run faster and your cloud backups happen automatically.
Of course, if your users almost never need to query your cold data, then it would be cheaper to store it offline than in Azure. But in most organizations, people do need to query cold data occasionally, and it would be a pain to bring that data back online every time it’s needed, so storing it in Azure is usually a better choice.
Microsoft even provides a tool called Data Migration Assistant that makes recommendations on many aspects of data migration, including Stretch Database. It tells you which databases and tables would be good candidates for Stretch Database. It also tells you about potential blocking issues, because there are a number of constraints that can prevent data from being migrated, such as tables that have more than 1,023 columns.
If you’d rather bite the bullet and just move an entire database to Azure, then there are two ways to do it. Ideally, you would move it to Azure SQL Database, which is a managed version of SQL Server. One of the big advantages is that you wouldn’t need to worry about maintenance anymore, because Microsoft takes care of all patches and updates, as well as all of the underlying infrastructure.
It isn’t always possible to move an existing SQL Server database to Azure SQL Database, though, because they’re not 100% compatible with each other. In the same Data Migration Assistant that I mentioned earlier, it tells you what reengineering, if any, would be required to move a database to the SQL Database service.
If the reengineering effort would be too great, then one option is to move the database to a SQL Server instance running on a virtual machine in Azure. The disadvantage is that you would have to take care of maintaining the instance.
So, Microsoft has an offering called SQL Managed Instance that gives you the best of both worlds. It’s nearly 100% compatible with SQL Server, but it’s managed, so you don’t have to worry about maintenance.
Microsoft provides a surprising number of options for running and scaling SQL Database. First, there are three service tiers: General Purpose, Hyperscale, and Business Critical. These service tiers have very different architectures from each other.
The General Purpose tier is the default. If you don’t have any special size, performance, or availability requirements, then this is the best and least expensive choice. Despite its budget pricing, it still offers a low 5 to 10-millisecond latency and 99.99% availability. One of its limitations, though, is database size. It can’t hold more than 4 terabytes (or 8 terabytes for a Managed Instance).
If you need more space than that, then the Hyperscale tier is the way to go. It supports databases of up to 100 terabytes. Not only that, but it can scale its compute resources up and down very quickly. It also provides instant backups and fast database restores.
If you need lightning-fast performance, then the Business Critical tier is the best choice. It provides 1 to 2-millisecond latency. It does this by using local SSD storage. Normally, it’s risky to use local storage because that makes it more difficult to provide high availability, but Microsoft has solved that problem by using a 4-node cluster. Of course, that makes it more expensive, too. You can even get higher availability than the General Purpose tier by choosing the zone redundant option, which spreads the three read replicas in the cluster across the availability zones in a region. This gives it a 99.995% availability. You can’t scale a database beyond 4 terabytes, though, so if you have a database that’s bigger than that, then you’ll still need to use the Hyperscale tier and lose the advantages of the Business Critical tier.
Okay, we’ve gone through the service tiers, but there are lots of other options as well. The next decision to make is which resource type to use. The choices are Single Database, Elastic Pool, and Managed Instance.
A single database is pretty self-explanatory. If you only need one database, then this is usually a good option.
The difference between the single database model and the elastic pool model is that an elastic pool can contain multiple databases that share resources with each other. This model works well if your databases are not all busy at the same time.
For example, suppose that you have five databases, and they all experience spikes in demand occasionally, but they don’t all experience these spikes at the same time. Putting them in a pool of shared resources would be much cheaper than provisioning five single databases because you’d have to provision enough resources for each of the single databases to handle spikes, even though they wouldn’t need that capacity most of the time. With elastic pools, you can also set minimum and maximum resources for each database to ensure that one database doesn’t consume all of the resources in the pool.
As I mentioned earlier, a Managed Instance has the highest compatibility with SQL Server. If you need to migrate an existing SQL Server database to Azure, then Managed Instance is the easiest way to do it. One disadvantage, though, is that Managed Instance doesn’t work with the Hyperscale tier, so it doesn’t support databases larger than 8 terabytes. I should also mention that elastic pools don’t work with the Hyperscale tier either.
All right, there’s yet another option you need to know: purchasing models. There are two different ways you can pay for your databases: DTUs or vCores.
Database Transaction Units, or DTUs, used to be the only option. A DTU is a balanced bundle of compute, storage, and I/O resources. The problem with this model is that it’s not very flexible. For example, if you have a workload that needs a lot of storage but not much compute power, then you’ll be paying for a lot of excess compute power that you don’t need.
Microsoft introduced the Virtual Core, or vCore, model to provide more flexibility, and it’s now the recommended purchasing model. It allows you to select the number of virtual CPU cores separately from the maximum storage space size. It also lets you specify the hardware generation, which includes options for memory-optimized and compute-optimized configurations.
And we’re not done yet because there’s another purchasing option within the vCore model. You can choose either Provisioned or Serverless. With the Provisioned model, your SQL Database instance gets provisioned with the exact resources you requested, and you get charged for the database as long it’s running.
With the Serverless purchasing model, you specify the minimum and maximum number of vCores, and it will autoscale based on workload demand. Then you can configure it so that if there’s no activity for a period of time, it will pause the database, and you won’t be charged for any compute resources until new activity causes the compute resources to come back up.
This sounds great, doesn’t it? So why wouldn’t you always choose the Serverless option? Well, the cost per vCore is higher for Serverless than it is for Provisioned, so if your database was active most of the time, then the Serverless option would be more expensive.
By the way, even if you don’t choose the Serverless option, you can still scale your database manually when you need to. You can just edit the database’s configuration and change the number of vCores (or DTUs if you’re using that model). There will be a brief loss of connectivity while it switches over to the new resources, but it generally takes less than four seconds, so it shouldn’t be very noticeable.
Regardless of which service tier you choose, high availability within a region is handled automatically. SQL Database provides HA through a technology similar to Always ON Availability Groups in SQL Server. The good news is that you don’t have to do anything, even if there is a failure, because failover is automatic too. It may take up to 30 seconds to recover, but that should be acceptable in most cases.
To protect against regional failures, you need to configure active geo-replication. This lets you create up to four read replicas in different regions. If the primary database becomes unavailable, then you (or your application) can failover to one of the secondary databases. That secondary then becomes the new primary.
To make this work, you need to ensure that your secondaries have user authentication configured in the same way as the primary. One way to do this is to use Azure Active Directory so you don’t need to manage credentials manually on your databases.
You should also use the same firewall rules for secondaries as you do for your primary. One way to do this is to use database-level firewall rules, which are replicated with the database.
Active geo-replication also has the side benefit that you can use the secondary databases to make queries faster for users in other regions. This is possible because the secondaries are read replicas, so as long as a user doesn’t need to write to the database, they can connect to the nearest secondary instead of the primary.
If you’re using a Managed Instance, then disaster recovery is handled by auto-failover groups, which is a layer on top of active geo-replication. Auto-failover groups allow you to failover multiple databases at the same time, which is a really nice feature.
Even with high availability, backups are still important, so you can recover from cases of accidental corruption or deletion. Luckily, SQL Database automatically takes care of backups too. It saves them in read-access geo-redundant storage (RA-GRS) to protect against regional failures. Transaction log backups happen about every 5 or 10 minutes, so you won’t lose much data if you do a point-in-time recovery.
Backups are saved for 7 days by default, but you can increase the retention period to up to 35 days. If you need to save them for longer than that, then you can configure a long-term backup retention policy for each database. You can keep backups for up to 10 years.
Normally, you would restore data to the same server where it was originally located, but if you need to restore a database to a different server, then you can use Geo-restore. As the name implies, this is usually used to restore a database to another geographic region. If you’re not using active geo-replication, then this is the way to failover to another region. It typically takes about 12 hours, though, so it’s not a good solution for mission-critical applications. It can also result in the loss of up to an hour’s worth of data, since it can take up to an hour for the backup copy in the primary region to be replicated to the secondary region. If the primary region goes down, then the secondary region may not have the latest backup.
And that’s it for relational database storage.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).