Introduction to the AWS Database Migration Service
The course is part of these learning pathsSee 1 more
Hi and welcome to Course Four - running a heterogeneous database migration with Database Migration Service. Following this course we will be able to recognize and explain the fundamental steps required to convert a microsoft SQL server database schema to a new database engine hosted in Amazon Web Services RDS using the AWS Schema Conversion Tool. We will learn how to review and analyse the assessment reports. We will then migrate the data from the source database engine to the new database schema using the AWS Database Migration Service
Following this lecture, we be able to recognize and explain the fundamental steps required to convert a Microsoft SQL server database schema to the MySQL database engine. To achieve this, we're going to use the AWS schema conversion tool. We will learn how to review and analyze the assistant reports provided by the tool. And we'll do a hands on demonstration of converting the schema. We will then migrate the data itself from the source database engine to our new database schema. To do that, we're going to use the AWS database migration service. The steps required to perform our migration are roughly as follows. For converting our schema using the AWS schema conversation tool, we're going to download and install the schema conversion tool. We're going to download the JDBC drivers, which stand for java database connectivity drivers for the schema conversion tool. We're then going to connect to our source database, which is our Microsoft SQL server. We're going to connect to our target database, which is our MySQL server. And then, we'll run an assessment report. Now, once we've completed that schema and assuming the schema conversion goes as we planned, then we're looking at converting our data. To do that, we're going to use the database migration service, which if you recall is something that runs from the AWS console, and runs on an EC2 instance that we provision in our AWS account. So, first of all, we need to provision that replication server. We can set some parameters the memory that's used. We can set some parameters around the storage that's required. And of course, what size machine do we want to have for this? We don't need to over provision it. We also need to make sure that it's going to be up to the task at hand. And also remembering that we're paying only for what we use. But we still are going to be incurring costs for the time that that server is active. We're going to define our source database and our target database end points. So, our source database will be our Microsoft SQL server, which has a public end point which we'll connect to. And our target endpoint will be the MySQL database, which is hosted in the relational database service in AWS. Then we're going to create one or more tasks to migrate the data between the source and the target databases. And a typical tasks consists of three phases if you remember. There's the full load of our existing data first. Then, there's the application of cased changes, i.e. those things that may have occurred during the migration itself. And then, we also have a third stage which is our ongoing replication. So, how often is that going to be running? And do we need to have an ongoing replication process? Okay, let's run our scheme conversion with the AWS schema conversion tool. So, first we need to download the conversion tool from the AWS website. And now, we need to install the application following the steps outlines on the AWS website for our particular platform. Once a tool is installed, you need to download and install the JDBC drivers for the supported database engines as well. This is because the AWS schema conversion tool uses JDBC to connect to databases. You can find the drivers on the AWS website at the link showing. For our test cases, we'll download and install two drivers. The JDBC driver for Microsoft SQL server and the JDBC driver for MySQL. Once the installation is complete, you need to tell the AWS conversion tool where to find these drivers. This is done by selecting the global settings menu option and choosing drivers from the dialog box. With the initial setup complete, we can now start a new migration project. Now, every migration has to be part of a project. In our case, we will convert the well known adventure works database to a MySQL 5.6 RDS instance. The first step is to define the project properties. In this screen, we can see that the source database can either be Oracle, Microsoft SQL server, MySQL, or Postgres. And our target database can be MySQL, Aurora, Postgres, or Oracle. Once the project is created, it's time to connect to the source instance. Here we are connecting to our SQL server 2012 instance that we have running on our local network. If we're using a source address that's hosted in the public domain, our connection might look something like this. Here we have a Microsoft SQL server running on the Amazon RDS service. Keep in mind that we need to have a public endpoint configured for this service to work. Although SQL servers have instance names and default instances typically have the same name as the Window's box that they're running on, we don't have to specify it here. One thing to notice, that the user account must be a SQL server standard log in, and not a Windows domain or local account. If the SQL server is running in Windows only authentication mode, this is not going to work, because no standard logins will be available or supported. Once the source is connected, its databases and schemas under each database are loaded in the left hand side of the screen. If you are a SQL server database administrator or developer, you may notice a couple of things straight away. A SQL server instance has many components that are usually visible from the SQL server management studio, which is the de facto client tool for SQL server. Now, these objects include things like logins and server roles, jobs, alerts, other links servers, replication publishers, and subscribers, et cetera. Even within each database, there are objects like users, partitions, full indexes, et cetera. But when you connect to the instance with the AWS schema conversion tool, you only see the databases and the objects like tables, indexes, the constraints, and store procedures are usually accessible. So, if you're wondering why this is, keep in mind that the AWS schema conversion tool is built for converting only database components, nothing else. That's why it shows database objects from the source instance only. In the SQL server management studio, similar types of components in a database are grouped together. For example, all tables from all schemas in a database are generally listed under the tables folder. The same types of folders exist for views, store procedures, or functions. You might note that even within each type of object, there can be further grouping for related objects. And for example, the node for a particular table can have all its indexes listed under one folder, and all the constraints listed in another folder. With the AWS schema conversion tool, this classification is shown in a different way. Here, each schema under a database will have its own folder, and under each schema, there is a sub folder for tables, views, store procedures, et cetera. So, in essence, even though the objects could belong to the same database, their placement may be under different schemas. Let's have a look at the side by side comparison. This means when you compare the source database components, the highest level you can convert from is the schema level, not the SQL server database level. That means each separate schema under a SQL server database will end up as a separate MySQL database in the target. Next, we need to connect to our target environment, the MySQL RDS instance. And for this, we'll provide similar information as we did for our source database. As your target database is based in your AWS virtual private cloud, you are going to need to open your security port to allow connection from the IP address that you're running your schema conversion tool from. Once connected, the target instance schema objects are loaded in the right pane of the tool. It's now time to check what components from our source databases can be converted to MySQL in the same format. For this, we select a schema, right click on it, and select create report from the pop up menu. This will fire off the schema conversion engine. The AWS schema conversion tool will do reverse engineer every object within the schema and check if a generated code can be run against the target instance without any change being required. Now, if it can, it will make a note of this finding. If it can't, it will make a note about why it's not possible. At the end, the assessment will generate a report. That report is called the database migration assessment report. There are two parts of the assessment report. The summary part and the actions item part. The summary page gives an overall picture of the migration possibility. It shows how many objects in each type of component can be converted and how many can't. The actions item page will go deeper into this analysis. It will list every object that can't be converted to the target engine, and the reason for their failure. It's very helpful that the AWS schema conversion tool highlights the particular command or syntax in the generated code that cause the problem. This code is viewable from the lower half of the actions item screen. You can see the bottom half of the action item screen shows the generated code for a particular object that can't be converted. When you select an individual problem item from a top half of the screen, it's related object is highlighted in the left pane, and its generated code is shown in the bottom. With the assessment report giving you the overall migration possibility, at this point, we've got two options. One, we can make changes to the source database structure so it matches the target database syntax. This may mean changing datatypes for table columns or rewriting store procedures or triggers, or even doing functions with different syntax. Now, in most migration scenarios, that's not going to be much of an option. We can apply the converted schema to the target instance as is, and them make manual changes to the target database by creating tables with the correct database, or write code with the correct syntax. The second option is the preferred method as it's much simpler and is preferred by most data engineers. Migrating the schema to the target environment is a two step process. First, we convert the schema. This is where the AWS schema conversion tool will generate a local script to apply to the target environment. Second, we apply the converted schema. The converted schema code is actually applied to the target database instance and the target objects are created. So, here we've chosen to create two schemas within the adventure works database, human resources and DBO. With the schemas converted to MySQL's format, the target instance will show the databases created. However, they don't actually exist yet. To physically the database objects, we have to apply the schemas from the target database pane. Once that's done, that completes the initial schema migration. And now we can look to migrate our data using the AWS database migration service. The first thing we'll do is access the database migration service from the AWS console menu. Now, remember, we need to provision an instance for the service to run on. We'll provisioning an EC2 instance that the service will use. And when we do that, we're presented with a number of options. So, let's just review what the database migration service will be doing, so we can decide the best options for your migration. Now, with the service, data tables are loaded individually. And by default, they're done eight tables at a time. So, while each table is loaded, the transactions for that table are cached in memory. Now, after the available memory's all used up, the transactions are cached to disk. And when the table for those transactions is loaded, the transactions and any further transactions on that table are immediately applied to the table. Now, when all the tables have been loaded, and all outstanding case transactions for the individual tables have been applied by database migration service, the source and target tables will be in sync. So, at that point, the database migration service is going to apply the transactions in a way that maintains as much transactional consistency as possible. So, we're gonna need quite a lot of memory basically. So, what size instance should we choose? Now, the smaller instance types can work well. But do keep in mind that the T2 type instance is designed to provide moderate baseline performance and the capability to burst to significantly higher performance. So, a lot of database migration service processing occurs in memory. Now, if your instance is too small, that could become the bottleneck for your migration. And a bottleneck can make a huge difference to how long your migration takes. So, T2 instances are intended for workloads that don't use the full CPU cycle often or consistently, but they occasionally need to burst to higher CPU performance. So, T2 instances are really well suited for general purpose work loads, such as web servers, developer environments, and smaller databases. For migrations that contain a lot of large objects, you may want to choose a larger instance size to ensure that you have adequate CPU. Now, keep in mind too that we're paying for this in a pay as you go model. This is a EC2 instance like any other that you'll spit out. So, of course it's going to have an hourly charge. If the migration takes a lot longer due to a bottleneck, then you're gonna be paying more. So, it's probably better to consider having a larger instance type rather than run the risk of ending up with a CPU bound migration that takes twice as long. Now, if you trouble shooting a slow migration, look at the CPU utilization host metric to see if you are bursting over the baseline for your instance type. Under the advance tab there's a number of options. One is the amount of storage we want to use. So, depending on the instance class you choose, your replication server's gonna come with either 50 gigabytes or a 100 gigabytes with data storage. Now, that storage is used for the log files and for any case changes that are collected during a load that can't be done on memory. So, if your source system is busy, or if you're running multiple tasks on that replication server, which is quite possible, you might need to increase the amount of storage you have attached to it. Now, all storage volumes in the database migration service are GP2 or general purpose SSD. Now, the GP2 volumes come with a base performance of three IO operations per second or IOPS. With the ability to burst up to 3,000 IOPS on a created bases. So, as a rule of thumb, check the read IOPS and write IOPS metrics for the replication instance. And just be sure that the sum of these metrics does not cross the base performance for that volume. And again, you cloud watch metrics can give you some insight into how your IOPS is performing. There's relatively little disk space that's required to hold the case transactions. And the amount of disk space used for a given migration is gonna depend on a few things. First of all, how large are your tables? Large tables are gonna take longer to load. And so, transactions in those tables are going to be cached until a full table is loaded. Once a table's loaded, these cached transactions are applied and they're no longer held on disk. The other consideration is the data manipulation language. So, a busy database generates more transactions. So, these transactions must be cached until a table is loaded. Remember that transactions in an individual table are applied as soon as possible after the table is loaded. And until all tables are loaded. So, at that point DMS applies all the transactions. The other consideration is transaction size. So, data generated by large transactions is going to need to be cached. If a table accumulates 10 gigabytes of transactions during a full load process, those transactions will need to be cached until a full load is complete. The other consideration is the total size of the migration. Large migrations are gonna take longer and the log files that are generated will be larger. And another one to think about is the number of tasks. So, the more you have, the more caching that's likely to be required. And so the more log files that will be generated. What you'll find is that it's the log files that consume the more amount of storage. It's unlikely that you're going to see what's provisioned under those 50 or a hundred gigabytes defacto storage sizes. However, like we said earlier, every migration is unique. So, you might find that you need a bit more. Now, the other option is whether we run our migration instance as a single or multi-AZ instance. Now, if we select a multi-AZ instance, that's gonna protect our migration from any storage failures. And most migrations are transient and not intended to run for long periods of time. But if you're using the data base migration service for an ongoing replication process, then selecting the multi-AZ instance is gonna give you more availability should as storage as you occur. If you're running a proof of concept or an assessment only, then a single instance is most likely gonna be enough. Another option we have is using encryption. So, we can use the Amazon and key encryption service to encrypt our data at risk. And using the KMS, the Amazon key management service is a very simple way for us to encrypt that data so that we get an extra layer of protection. The KMS master key option enables us to set the CMK key, which stands for "customer master key", that will be used to protect our data if we decide to encrypt our database volume. So, we can choose to either use our default key or we can have our own or customer managed CMK key created using the encryption key's tab inside the IAM console. Next, we need to identify which virtual private cloud we want the replication instance to run in. And the security group we want to apply to that instance. Now, if you're not overly familiar with the AWS virtual private cloud, you can select the default virtual private cloud or VPC. Now, that is pre-configured with each new AWS account. And using a default VPC also means you can use the default sub-net security group and default internet gateway. If you have configured your own VPC, ensure you have an internet gateway with a route to the public domain and that you have configured in and out egress for the ports used by your source and target databases. Our source database is Microsoft SQL server, which runs on port 1433. I have therefore defined network access rules and a security group with inbound and outbound rules for those particular database ports to allow in and outbound traffic on port 1433 to and from the IP address of my replication server instance. To get the IP address of the replication instance, you do need to dig around a little bit. The details of the replication server are located in the replication instances tab in the database migration service menu. With the replication instance setup, we now need to add and configure our source database and our target database. These will most likely be the database endpoints you used with the schema conversion tool. However, if either database is running locally, you will need to create and configure a public endpoint for that database. While the database schema conversion tool ran on your local computer, the database migration service runs IN your Amazon web services virtual private cloud, which is in the public domain. To setup our source database, first we need to give it an identifier. So, we type a name that we want to use. You may want to include the name of the type of endpoint such as MySQL server source or whatever you want to use. The name must be unique for each replication instance. We would then need to choose the engine for our source database. We choose the type of database engine, Microsoft SQL server and then we need to give it the server name. So, for on premise databases, this might be an IP address or the public host name depending on how you've setup the routing into your network. If you're using an Amazon relational database service database instance, then this will be the endpoint which is also know as a DNS name for that DBS instance. And it will be configured similar to the one you'll see on the screen here. We'd also need to specify the port, and because we're using a Microsoft SQL server for our source, our port number will be 1433. We have the option of setting the secure socket layer mode as well, or SSL. If we want to have encryption of our transport, then we need to select which mode we want. We then need to give the user name for the database instance. Now, this needs to have the permissions required to allow data migration. We also need to provide the password. If we're using Microsoft SQL server, then our login needs to be a Microsoft SQL server login. We can't use Windows authentication at this stage. So, we need to provide a user name and password for a user that has the authority to be able to create a full database backup and to access all of the relevant tables. Okay, once we've got our source and target database endpoints setup, we can now go ahead and setup a task. Now, remember, you can always go back and reconfigure your replication instance from the database migration service menu. You can change any of the parameters and also change the endpoint configuration for that particular instance. Now, do remember to comeback and delete the replication instance after you've finished. Because I just noticed it's quite easy to have that replication instance sitting there running without noticing that you've got to actually get rid of it once you've finished your task. Anyway, let's get into tasks. So, now that we've created that, we can setup a replication task that's relatively simple. Remember, we have three types of task. We've got the full load of data. We've got the application of cached changes, and we've got ongoing replication. Now, this is an introductory lecture. So, it's a beyond the scope to each of these tasks in great detail. Let's walk through the options we have for one of our full load of data tasks. So, you get an idea of what options you have available. The first thing we do is give it a task name, and that needs to be a unique name. We can set their replication instance. So, if we do have more than one instance, then we need to select the correct instance to use. We select our source endpoint, our target endpoint, and we select our migration type. So, the three options we have our migrate existing data which is the first option. The second is to migrate existing data and replicate on-going changes. And the third is to replicate any data changes only. We have a number of options under the task setting tab. So, we have under preparation we can do nothing, it's with the data and meta data of the target of the tables and not changed. If we opt to drop the tables and the target, then the tables are dropped and new tables are created in their place. If we truncate the tables, then they're truncated without affecting the table meta data. Now, one thing to keep in mind is that the large object support is quite crucial that we set this right. Large objects can take a long time to copy over. Large object column options we have that we can include the large objects in the migration. We can opt to not include them. So, they'll be excluded from the migration itself. We can sit a full large object mode, which migrates the complete large object regardless of its size. So, you can imagine that could be quite timely if you have quite a few of them. The large objects are migrated piece wise or in pieces, and the chunks are controlled by the large object chunk size. Now, that method is much slower than using the limited large object mode, which only takes a certain size of a large object. And that's the third option there. That basically truncates the large object to a maximum large object's size that you can set in the below there. That's gonna be a lot faster. So, if you set that to be a smaller size, then the migration will happen much quicker. And of course, we have the option to enable login. Now, under the advanced settings, we've got a few extra switches that we can set for each of the three migration types. We can set a create control table, which is the schema on the target database where the database migration service can create its control tables. So, those can be deleted once you've finished. We can set the duration of each history record so that we can limit the number of recorded rows we have in our log. And we can set what exceptions we do capture, whether we apply all, whether we set a status as well, or we suspend tables, or set the replication history. For tuning settings, we can set the maximum of tables to load. Remember, the default number is eight at a time. So, we can set it to be higher or lower if we want. And that may be something we consider to tune our migration if we do find it's taking too long. We can also set the transaction consistency timeout value. So, if we have a slow connection or intermittent connection performance, then making this higher can reduce the amount of timeout errors we might receive. And we can also as a tune, set the maximum number of events that can be transferred in one goal. So, the default is 10,000. We can set this higher if want to try and run more concurrent load at any one time. Again, it's gonna come down to what type of bandwidth we've got and what type of performance we can expect on our replication instance. We can set how the database migration service handles table names. And if we don't want the schema to migrate any particular names, we can also actually view the adjacent file for how that method will be handled. So, the default is that it's gonna migrate all source tables to target tables of the same name if they're already there. Otherwise, it's gonna create a source table on the target. So, we can set it to override this behavior and if we want to limit the number of individual selected tables in the migration. And you're able to configure your own adjacent script inside that. Okay, so that's the basis of a task. And of course, we would just need to run that. One thing I do recommend doing before you run a task is just to check that your source and target database endpoints are active. Now, you can do that from inside the console. There's a simple test there. And if there are issues, that gives you an opportunity to go and fix that connectivity. Most common problems you'll have at this point if you do strike an error. Your replication server or access to your source or target databases that might be causing an issue if you strike one at that point. Alright, once our source and target databases are accessible, we can create a task name. Keep it simple. We choose the replication instance. We choose our endpoint and our target. We choose our migration type. And we choose the data from the schema that's been populated from our source database. So, we can only choose one table at a time during this. So, we'll choose the human resources table set. Okay, so once we click create task, it defaults to start once the task is being created. We can see in the console there that it's creating the task and below is viewable the method and the task settings. So, when you're debugging, it's quite useful to have this view because you can see where things might not have gone right. And because we turned on logging, we can check back and see what log errors have been created at any one time. Now, it's likely you'll get a few errors intermissions on your first two or three runs. The task monitoring is fantastic. Of course, since it's cloud watch, you'll have to wait a few minutes before you'll get any data populated in there. But when the task is running it's very useful. I certainly found it useful for debugging a filed load earlier. So, once this task actually is ready to go and has started, then the table statistics will show us more detail on the actual rows that are being copied, deleted, or updated. Okay, so, our task is now ready. So, we will find it's about to kick off. Here we go, it's starting. So, now we should be able to log some detail in the console here. Takes a couple of seconds to refresh, but it's fantastic. It's a dynamic responsive interface. There we go. So, that's our first view on our stats. And it gives us a full breakdown of what inserts, deletes, updates, DDLs, and full load rows have been completed. Again, the log is the single source of truth for everything that goes on or doesn't go on. So, going back into the cloud watch log is really, really powerful. We can do that while the task is running. We get some good metrics about how many of the tables have been loaded. What type of performance we're getting out of our replication instance. So, I highly recommend looking at this. If it looks like you're getting a bottleneck around CPU performance, then you might want to step up the size of your instance and rerun your migration. In this instance, it's completed. That's fantastic. So, we can now evaluate the actual selections and to see whether the data we wanted was copied. So, in our table summary, we've had six tables loaded, which is a success. Okay, I just want to share with you a few of the troubleshooting things I came across when I was running my migrations. The first one was just around connectivity. Most of the connectivity problems I encountered came back to security groups or network access control lists. So, make sure that you set the ingress and egress rules for your endpoint. So, you need to make sure that your replication instance can connect to your endpoint. And you need to make sure that your endpoint is able to do the same. So, you need to actually set specific inbound and outbound rules. You also need to ensure that you have the correct roles setup in the IAM user account that you're using to create your replication instance and your endpoints. So, while the documentation says that the DMS service creates these roles for you, it's well worth checking that you do. Especially, if you're not running your account as route which you shouldn't be. So, just by all means, if you're having a number of errors that just don't seem to make any sense, go back to the IAM console and check that you have this policy. Okay, that concludes our migration lecture. Thanks for your attention.
About the Author
Head of Content
Andrew is an AWS certified professional who is passionate about helping others learn how to use and gain benefit from AWS technologies. Andrew has worked for AWS and for AWS technology partners Ooyala and Adobe. His favorite Amazon leadership principle is "Customer Obsession" as everything AWS starts with the customer. Passions around work are cycling and surfing, and having a laugh about the lessons learnt trying to launch two daughters and a few start ups.